Finding Hidden Insights in Data Using non-linear Support Vector Machine and Statistical Methods
My name is Suhas Vittal, and I am a rising senior at Basis Scottsdale. Throughout my internship, I worked with biological datasets in an attempt to extract new information, often by using statistical and machine learning methods (1). My internship as a whole was split into three parts. In the first part, I worked with data containing information on transcript assays, lipid concentrations, protein expression levels, and miRNA assays. The original objective of this project was to “find novel insights”: a pretty vague objective (as new research is). The most common statistical approach I used was a non-linear Support Vector Machine (SVM)(2) to test relational strength through the coefficient of determination.
It may be surprising to know that values can be less than 0 for multi-dimensional data; for two dimensional datasets, we often interpret it as the square of the coefficient of correlation (in which case it would be positive), which is why we call it a square. But for some datasets, particularly when I compared miRNA assays to transcript assays, we often obtained weak determination values. This conflicted with the understanding that miRNA strands inhibit and act on certain mRNA transcripts, motivating us to drop this dataset and move on to my next project. The second project focused on finding new relationships amongst a dataset containing information on young adults. Features included their diet, allergies, parents’ diseases, and protein expression levels; since we were interested in phenotypic relationships to biological objects, I generally compared protein expression levels (a continuous variable) to the phenotypic variables like diet (a discrete variable). To this end, most of my analyses were focused on classification algorithms. Most of my analyses featured the Multilayer Perceptron, or MLP, model, which is often synonymous with deep learning. The MLP model features an input layer of neurons, some number of hidden layers, followed by an output layer; deep learning features MLPs with around 100-200 layers. However, the largest issue was the number of responses to analyze. Using an MLP enabled a shotgun approach, where we compared protein expression levels with multiple response variables: this mitigated the fact that there were so many responses to analyze.
However, using a shotgun approach created the dilemma where if protein X was related to responses A and B, I could not tell you which one it was related to, but rather that it was related to both. Nevertheless, interesting results were finding sEGFR (soluble epidermal growth factor receptor) being related to diets, which is significant as sEGFR has been connected to epithelial ovarian cancer, and finding G-CSF as a protein with a larger distribution spread than other proteins (3). The last project focused on applying the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm to the prior dataset in order to visualize relationships. The objective was to visualize the relationship between protein expression levels and a participant’s perceived risk for diabetes. Originally, we tried to develop a distance hierarchy, which is a tree representation of relationship, but then we turned to applying a naïve Euclidean distance method and a more experimental method with the algorithm Data Fusion by Matrix Factorization (DFMF); if DFMF was successful in creating clusters, we could potentially apply t-SNE to a graph database (4). However, DFMF did not really live up to the hype – but it did get better as more data was added, so for future work, we could apply DFMF to larger graph databases to see clustering. The naïve method worked perfectly by compressing the data into a 3-dimensional embedding, yielding two curves and a mid-point that was surprising, which I labeled the “at risk” group. Future work on t-SNE is planned, where we will apply an HDB Scan on the t-SNE output and then run an ANOVA test on proteins in each of the three curves to see which ones are potentially important.
My future is probably hazier than anything I want to do with t-SNE. My major will be computer science, but as to what I will do with that degree – research, industry – is less clear. Right now, three fields that really interested me are Automata Theory, Formal Languages and Machine Learning.
- Data provided by Immunomonitoring program at A*STAR/SIgN
October 18, 2018