In recent years, many natural products were purified and shown to have cancer chemopreventive activity in laboratory, as exemplified by Camptothecin, Vinblastine, Embelin and Paclitaxel (Dai et al., 2011; Goldwasser et al., 1995; Lynch et al., 2012; Of Trialists, 2011). These agents from natural source have contributed significantly to the successful treatment of melanoma, leukemia, breast cancer and many other carcinomas. In addition, more and more new derivatives based on the structure of natural products have become promising candidates for antitumor drugs through laboratory design, synthesis and screening (Chen et al., 2006; Rodríguez-Berna et al., 2014; Silvestri, 2013). However, experimental methods for searching natural product lead structure suffered from the drawbacks of expensive and time-consuming. Therefore the use of computational methods based on structure-activity relationship (SAR) has been intensively investigated.
Traditional exploring approaches of SAR focus on producing a range of analogues based on the basic skeleton of lead structure by synthetic chemists and searching empirically for their structural properties predictive of the antitumor activity (Cao et al., 2013; Dong et al., 2012; Liu et al., 2015; Zhang et al., 2007). These SAR studies tried to predict responses in a single cell line or a single tissue type using only structure data. Although much progress has been made, the problem of predicting natural products response is far from being solved.
In the present study, we developed a machine learning approach to predict the cell lines response to natural products, based on gene expression of cancer cell lines (genomic information) and the chemical descriptors of the considered natural products (chemical structure) for the first time. Empirical studies show that our method can obtain good performance when predicting sensitivity for hundreds of cancer cell lines to natural products in test set and case study analyses and indicate that both the structural properties and gene expression signatures are important determinants of antitumor activity of natural products. Taken together, this study outlines a first approach to predict drug response for natural products and generate novel natural product candidates for further studies.
Materials and Methods
In order to develop robust predictors of response to natural products, we collected and annotated a published large-scale preclinical dataset, namely, the Genomics of Drug Sensitivity in Cancer (GDSC) (Garnett et al., 2012). This large dataset includes drug sensitivity data from 138 drugs across almost 700 cell lines. By retrieving the drug information from PubChem database (http://pubchem.ncbi.nlm.nih.gov), we identified 17 drugs as natural products or their derivatives from these 138 drugs (Table 1). These natural products in GDSC were screened across a range of 279–565 cell lines per drug (mean = 495 cell lines per drug) representing 8,420 cancer cell line-drug interactions. The publically available drug sensitivity (Drug IC50 values) data for all the 17 natural products was downloaded from GDSC (http://www.cancerrxgene.org).
|Dataset||Natural product||Number of cancer cell line-natural product interaction|
Among these 17 natural products, 13 of them were randomly chosen for models’ building, which represents 6,450 cancer cell line-natural product interactions (training set, Table S1). The remaining 4 natural products with 1,970 cancer cell line-natural product interactions were used in the test set (Table S2).
An independent test set (case studies) was extracted from the literature to further assess the performance of our proposed method. By searching anticancer herbs database of systems pharmacology (CancerHSP) (Tao et al., 2015) and natural products-related studies from the PubMed (http://www.ncbi.nih.gov/pubmed), we obtained two antitumor natural products (Curcumin and Resveratrol), which have been proven effective in inhibiting proliferation and inducing apoptosis of various kinds of cancer cell lines. For Curcumin, it was screened on 16 cancer cell lines derived from 5 cancer types; and for Resveratrol, it encompasses drug sensitivity data for 13 cancer cell lines derived from 6 cancer types. After removing cell lines for which we could not find the corresponding gene expression information in GDSC, we finally obtained 7 and 8 cancer cell line-natural product interactions for Curcumin and Resveratrol, respectively (Table 2).
The GDSC gene expression microarray data were derived directly from the work of Geeleher, Cox & Huang (2014). Subsequent analyses were restricted to 12,026 annotated genes with Entrez Gene ID.
The chemical features of the natural products were generated with PaDEL software (Yap, 2011) from the simplified molecular-input line entry system (SMILES) (Weininger, 1988). The SMILES files for natural products were collected manually from PubChem database (http://pubchem.ncbi.nlm.nih.gov). Initially, we obtained 1,444 1-D and 2-D descriptors of natural products directly from PaDEL. The chemical features with the same value across all natural products were further eliminated. Finally, we obtained 1,114 chemical features in this study.
Results and Discussion
Strategy for prediction of cancer cell sensitivity to natural products
Our goal was to use gene expression and in vitro drug sensitivity data derived from cell lines, with the addition of chemical properties, to predict cell lines’ response to natural products. The conceptual framework for prediction of cancer cell sensitivity to natural products is shown in Fig. 1. In the first step, cell lines in GDSC were clustered into two groups (Sensitive and Resistant) or three groups (Sensitive, Resistant and Intermediate) according to their sensitivities (drug IC50 values) to a given drug with K-Means algorithm in WEKA (Hall et al., 2009). Here K was set 2 or 3, which means that the cancer cell lines were divided into 2 or 3 groups. Samples in Sensitive and Resistant groups are used to build machine learning model. Then, the performance of J48 (Decision Tree), SVM (Support Vector Machine), Random Forest and Rotation Forest (Rodriguez, Kuncheva & Alonso, 2006) models were comprehensively evaluated. After this step, we used genomic features from gene expression data and chemical features to construct prediction model, where the optimal feature number were selected using t-test with R scripts (Gentleman et al., 2011).
Determination of number of cancer cell lines clusters
To find the optimal number of cancer cell lines clusters in K-Means algorithm, the prediction performance of different clusters (K) were evaluated based on 10-fold cross validation (training set) and test set using Rotation Forest models. As can be seen in Fig. 2, the AUC (Area under the receiver operating characteristic curve) is higher in the case K = 3 compared with those in the case K = 2 when features number is set as 50. The similar situation occurred when the features number is set as 100 or 500 (Figs. S1 and S2, respectively), so we chose K = 3, which means that the cancer cell lines in GDSC were clustered into three groups (Sensitive, Resistant and Intermediate), and only cell lines in Sensitive and Resistant groups were used in the subsequent analyses.
Assessment of feature importance
In feature selection step, a 10-fold cross validation on the training set was conducted to get the optimal gene numbers. Examination on predicted AUC with respect to numbers of selected feature numbers showed a consistent trend of increasing first and decreasing afterwards with the increase of selected feature numbers except SVM model (Fig. 3). As a result, the top 1,000 features were chosen as optimal features for further analyses.
There were 468 genes (genomic features) in the top 1,000 features, of which 59 genes are cancer related genes (oncogenes or tumor suppressor genes), where oncogenes were obtained from database Cancer Gene Census (Futreal et al., 2004), and tumor suppressor genes were from database TSGene (Zhao, Sun & Zhao, 2013). We carried out a permutation test as follows. We randomly sampled 468 genes from the whole 12,026 genes 1,000 times, and the mean of the number of overlapped genes was only 36.2. In addition, the maximum value in the 1,000 times tests was 54, which is also less than 59. A P-value zero was obtained in this permutation test. So the genomic features (genes) we chosen were more likely to be related to tumorigenesis.
The top 1,000 features also contained 532 chemical descriptors of natural products. The systematic machine learning-based integration of various data sources, including chemical structure and genomic information, can provide better discriminative power than those using only individual data sources. This may presents a simple and promising strategy to predict antitumor activity of unknown natural products using pharmacology data and machine learning approaches.
Comparison of different machine learning methods
In this study, in order to identify the best machine learning technique suitable for predicting cancer cell sensitivity to natural products, we comprehensively evaluated the performances of SVM (LibSVM), Decision Tree (J48), Random Forest, and Rotation Forest classifiers. All these algorithms were implemented using the Weka package with the default parameter configuration. Rotation Forest has been proven to be a relatively stable machine learning method in our previous work (Xia, Han & Huang, 2010), which also performed best using 10-fold cross validation (AUC = 0.87, Fig. 3) in this study. A consistent trend occurred in the test set (Fig. 4), where the AUC for for Camptothecin, Epothilone B, Paclitaxel, and Shikonin are 0.88, 0.89, 0.79 and 0.81, respectively.
The detailed classifiers assessment results of 10-fold cross validation (training set) and test set are shown in Table 3. The number of cancer cell line-natural product interactions for Camptothecin, Epothilone B, Paclitaxel and Shikonin are 321, 303, 168 and 244, respectively. The performance of each model is measured by five metrics: Precision, Recall, F-Measure, AUC and Accuracy (Fawcett, 2006), where Precision, Recall and F-Measure are calculated for each class, AUC and Accuracy are automatically weighted in WEKA for all classes. As is shown in Table 3, all the 4 methods obtained good results based on 10-fold cross validation (training set) and test data set.
|(A) Cross validation|
|(C) Epothilone B|
Support vector machines
To further illustrate the effectiveness of our approach for detecting cancer cell sensitivity to natural products, we present two additional natural products examples. By searching CancerHSP database (Tao et al., 2015) and natural products-related studies from the PubMed database (http://www.ncbi.nih.gov/pubmed), we obtained 2 natural products screened on 29 cancer cell lines: Curcumin (Bush et al., 2001; Choudhuri et al., 2002; Khor et al., 2006; Radhakrishna Pillai et al., 2004; Wang et al., 2006) and Resveratrol (Chen et al., 2004; Clément et al., 1998; Ding & Adrian, 2002; Hsieh & Wu, 1999; Lu & Serrero, 1999; Niles et al., 2003; Whyte et al., 2007), which have been proven effective in prevention and treatment of various kinds of cancers, including melanoma, lung cancer, ovarian cancer and so on (Tao et al., 2015). After eliminating cancer cell lines for which we could not find the corresponding gene expression information in GDSC, we finally obtained 7 and 8 cancer cell line-natural product interactions for Curcumin and Resveratrol, respectively. The prediction results in these two natural products are shown in Table 2.
Case study 1: curcumin
Curcumin, a phenolic compound from the rhizome of the plant Curcuma longa, induced apoptosis in tumor cells via a p53-dependent pathway or pathways independent of p53. We predicted responses of 7 cell lines that are sensitive to Curcumin, including 4 cell lines from melanoma, 1 cell line from lung cancer, 1 cell line from breast cancer, and 1 cell line from pancreatic cancer (Table S5). Notably, of the 7 cell lines that were defined as responders, 6 were correctly classified by our model (Table 2). The only cell line that was classified incorrectly is Sk-mel-5, a melanoma cell line containing wild-type p53. Because the rest 3 melanoma cell lines in this study contain mutant p53 (Bush et al., 2001), this may explain why our method could not obtain the correct result in Sk-mel-5 cell line.
Case study 2: resveratrol
Resveratrol, a plant polyphenol found in grapes and a variety of human foods, is reported to have protective effects against various cancers. The mechanisms of its action in these diseases are inducing apoptosis via different pathways, antiestrogenic effect and so on. Responses of 8 cell lines to Resveratrol were predicted in this study, including 2 cell lines from melanoma, 1 cell line from lung cancer, 3 cell line from breast cancer, 1 cell line from pancreatic cancer and 1 cell line from prostate cancer (Table S5). SK-MEL-28, one of the two human melanoma cell lines used here, was predicted to be sensitive. The other melanoma cell line, A375, is amelanotic differing from the former. And Resveratrol induced phosphorylation of ERK1/2 in A375 which can promote gene expression associated with proliferation and differentiation, but not in SK-mel28 cells. Whether these differences contribute to the incorrect prediction of A375 cell line response to Resveratrol remains to be determined. Breast and prostate cell lines used here were all classified correctly. Altogether, 5 out of the 8 cancer cell line-natural product interactions can be correctly predicted by our model (Table 2).
In this study, we investigated the inherent determinants of antitumor activity of natural products. For this purpose, we developed a machine learning method to predict natural products responses against a panel of cancer cell lines based on both the gene expression data and the chemical properties of natural products. Our results show that it is possible to enrich for natural products responders using gene expression and chemical descriptors, by applying models generated from a large panel of cancer cell lines. The performance of our approach was firstly evaluated using the 10-fold cross validation (training set) and test set, and further validated by modeling two additional natural products (case studies analyses). The experimental results show that our method can effectively predict the response of cancer cell lines to natural products.
Although our final best model is based on both the gene expression signatures of cancer cells lines and the chemical properties, novel features that better describe natural product sensitivity can be easily incorporated into our prediction system to further improve the prediction performance of natural product response. In our future work, we will add other genomic features such as mutation information into the prediction model. Besides these genomic information, epigenetic and protein level information also play very important role in natural product response mechanism, and thus should be incorporated in our prediction system. In addition, it should be noted that in the current study we focused on “natural product sensitivity in cancer.” In the future, we will consider extending our model to non-natural product sensitivity prediction. Last, we will offer an online web interface through which our approach can be implemented to computationally predict natural product sensitivity.
Comparison between the case K = 2 and K = 3 when feature number = 100
Bar chart showing in the case K = 3 (blue) we obtained a higher AUC than in the case K = 2 when features number is set as 100. Cluster3, the case K = 3; Cluster2, the case K = 2; CV, cross validation; Camp, Camptothecin; Epot, Epothilone B; Pacl, Paclitaxel; Shik, Shikonin; AUC, Area under the curve.
Comparison between the case K = 2 and K = 3 when feature number = 500
Bar chart showing in the case K = 3 (blue) we obtained a higher AUC than in the case K = 2 when features number is set as 500. Cluster3, the case K = 3; Cluster2, the case K = 2; CV, cross validation; Camp, Camptothecin; Epot, Epothilone B; Pacl, Paclitaxel; Shik, Shikonin; AUC, Area under the curve.