Selection of data sets for QSARs: analyses of Tetrahymena toxicity from aromatic compounds

Source Publication (e.g., journal title)

SAR and QSAR in environmental research

Document Type


Publication Date



The aim of this investigation was to develop a strategy for the formulation of a valid ecotoxicological-based QSAR while, at the same time, minimizing the required number of toxicological data points. Two chemical selection approaches-distance-based optimality and K Nearest Neighbor (KNN), were used to examine the impact of the number of compounds used in the training and testing phases of QSAR development (i.e. diversity and representivity, respectively) on the predictivity (i.e. external validation) of the QSAR. Regression-based QSARs for the ectotoxic potency for population growth impairment of aromatic compounds (benzenes) to the aquatic ciliate Tetrahymena pyriformis were developed based on descriptors for chemical hydrophobicity and electrophilicity. A ratio of one compound in the training set to three in the test set was applied. The results indicate that from a known chemical universe, in this case 385 derivatives, robust QSARs of equal quality may be developed from a small number of diverse compounds, validated by a representative test set. As a conservative recommendation it is suggested that there should be a minimum of 10 observations for each variable in a QSAR.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."