RNA sequencing data
RNA sequencing data was obtained from the Genotype Tissue Expression (GTEx) Project16,17 and The Cancer Genome Atlas (TCGA). As batch differences between different GTEx and TCGA submissions are well-documented, we utilized a common RNA-sequencing analysis pipeline to minimize batch effects.18. Specifically, all raw reads were imported for alignment against hg19 in STAR, with quality control done in mRIN19 (mRIN < − 0.11 threshold for sample exclusion), quantification in featureCountstwenty and batch effect correction in SVAseqtwenty-one. In total, 10,116 patient samples were used with 17,993 genes included based on commonality across datasets (Supplementary Table 1). Dimensional reduction was performed using Sklearn package StandardScaler and principal component analysis (PCA), and 2000 principal components were used for model transformation. As a benchmark, 1000 top features selected by Random Forest and all 17,993 features (no PCA) were included in a separate run of the same models.
deep learning model
Our deep-learning model consists of two models executed in tandem, the first is a multi-tasking model which classifies the type (non-neoplastic, neoplastic or peri-neoplastic) and tissue origin of the tissue. The subsequent subtyping model is primed to be executed only if the sample’s tissue of origin has subtyping data available.
Based on prior work in deep learning processing of transcriptomic data and model tuning, the encoders for both models are comprised of 7 fully connected, feed-forward neural network layers (FFNN, Fig. 1B,C). The purpose of the 5 hidden layers is to bring down the dimensionality of the input transcriptomic data. Each of these layers has a Rectified Linear Unit (ReLU) activation function on top of their outputs, which is used to restrict the output of these layers. ReLU was selected over Sigmoid or Tanh due to the lack of vanishing gradient and sparsity, ultimately resulting in faster learning and quicker convergence22. Hidden layers 3 through 5 also have dropout layers between their output and the next layer to reduce overfitting. In the output layer, we have task heads, which are represented by layers with a Softmax activation function. These layers map their inputs to the dimension equal to the number of classes for that task. Specifically, for the multi-tasking model, the first output head represents the type of tissue (non-neoplastic, neoplastic or normal peri-neoplastic, 3 classes) and the second output head represents the tissue origin (14 classes). Similarly, in the neoplastic subtype model, the output head presents the cancer subtype (11 classes). The Softmax activation function forces these output heads to output a probability distribution over their respective number of classes. All models were trained for 500 epochs.
Bayesian hyperparameter tuning
We performed Bayesian hyperparameter optimization using the hyperopt package23, using the minimization of the cross-entropy loss as our optimization objective over 25 epochs. For each of the FFNNs, the Cartesian product of the learning rate, batch size, dropout value, unit, optimizer, and activation functions were selected as the search space (Fig. 1A). Instead of arbitrarily setting discrete values within the learning rate, batch size and units, we opted to randomize the range using the randint function. The optimal hyperparameters were then selected after 100 evaluations (Fig. 1A–C).
Benchmarking against other Machine Learning approaches
We compared the balanced accuracy of our proposed deep learning classifiers against other machine learning algorithms in the Sckit-learn package24, including Decision Tree Classifier (DT), Extra Trees Classifier (ET), Support Vector Machine (SVM), Stochastic Gradient Descent (SGD) classifier, and K-nearest Neighbors Classifier (KNN). In these models, all 17,993 features were used as inputs, and a 70:15:15 ratio was used for train/validation/test splits.