Five Big Data Insights to Keep Yourself Sane
If cleanliness is next to godliness, then I have walked upon the most unholy side streets in the otherwise-wonderful world of Data. I am not referring to small datasets sitting in lonely places on the Internet. I am referring to the most prized and established public-domain Big Data repositories that one could find in the realm of bioinformatics. You might know their names. Pharos. PubChem. Hetionet. The Cancer Genome Atlas (TCGA). All of these bring something unique to our understanding of disease, especially cancer. TCGA describes the genetics of thirty-three different forms of cancer1. PubChem houses over 300 million chemicals and drugs, with additional data on how these chemicals interact with living tissue2. Hetionet portrays the connections between different datasets3. Pharos is the window into the “druggable genome,” or how drugs associate with genes and proteins4.
In the spirit of my first and second summers (yes, this is my third summer with Systems Imagination!), I embarked on a team project to extract, transform, and load the aforementioned datasets, plus several others. Along the way to the finished product––the nature of which I’ll clarify later––I learned something new from every dataset, especially the ones I spent more time with. In particular, I learned that it’s all too easy to go down rabbit holes, especially with the complexity that Big Data inevitably entails. Thus, here are my five Big Data insights, which may help one avoid these frustrations.
1. Big Data is trendy...up to a point.
Going to a nerdy school, I couldn’t count how many times we talked about “artificial intelligence” or “Big Data” over lunch. Any student with the right math and coding background dove headfirst these topics, and many who couldn’t be engrossed in mass-media articles about the latest technologies.
What nobody talked about, but what I soon came to realize, was that Big Data is a hot topic up until you get into data wrangling, the gritty, unwieldy, and often tedious process of processing raw data into clean data.
Yes, that kind of wrangling. Both areas are inevitably intertwined. Only ten percent of my entire time at Systems Imagination was devoted to data analysis. The other ninety percent was purely data wrangling. That’s not because I couldn’t analyze data. As I quickly realized, leveraging powerful tools like neural networks and graph traversals only worked once I had turned the raw data into something incredibly clean. Don’t feel frustrated if less than twenty percent of your project involves the trendy stuff. Without the gritty stuff, your tools will throw up confusing error statements, trivial correlations, or (worst of all) junk output disguised as novel insights. Which leads me to my next point…
2. Garbage in, garbage out.
Or, in the words of the “father of the computer,” Charles Babbage, “On two occasions, I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question.” Computers cannot think for themselves. Thus, It is never the computer’s fault, and always the humans’ fault, if the output is terrible. I say “humans” with a plural because it’s not always the coder’s fault. Your IT department could have messed up the last software update. Or another user inadvertently deleted your most precious data. Or somebody did something fishy to the data source.
The first two possibilities appear more likely, but of the three I listed, only the third was a clear and present danger to my endeavors. I recall that another Systems Imagination team had to contend with imputed data. This means that missing data points were filled with a number, in this case, the average of the non-missing data. Such data would produce nonsense correlations from most tools. Upon realizing this, their plans for analysis went bust.
A blatantly nonsense correlation. It’s rarely this easy to spot useless outputs6.
Before I talk about my next example, here’s a fun fact about how computers read text files. Every character is converted into a numeric code that tells the computer what to do. Some characters are special: NEWLINE tells the computer that a new line has started. TAB indicates a tab, sometimes used to separate a file’s row into distinct columns. NULL means the end of the data.
As we realized, some of TCGA’s data had both “NULL” and “TAB” values in all the wrong places. Our computers were very confused: “Why is there a TAB here? Does that mean this row had seven columns when the last row had five?” Or worse, “why is there a NULL here, when the file does not end here?” Assigning blame wouldn’t have helped much, even though the list of suspects here is relatively small. The reality was that the data had soaked up problems somewhere, and it was up to us to fix it. Which we did.
3. Good data + bad formatting = bad data.
A little backstory about the PubChem dataset. PubChem contains, as I said earlier, data on over 300 million chemicals. Some of these chemicals might perform critical functions that would contribute to, say, a novel treatment for triple-negative breast cancer. I call that excellent data. Unfortunately, the sound data had a lousy format: the structure-data file (SDF). SDF may be familiar to chemists, as it is a prime choice to store data on chemical structures. Strangely, most of the data in PubChem’s SDF files were not chemical structures. Here’s a snippet of some data from everyone’s favorite drug: caffeine.
As an aside, I generally resisted the allure of caffeine throughout my third summer with SII.
These additional values were the only ones I needed to parse, not the chemical structure data. Here they were, in the same file, with a structure that is not remotely tabular. Bye-bye SQL.
The solution is ugly but necessary: custom code. With the full power of one Windows desktop, my Python code would have taken two months. Instead, I used eight desktops, reducing the runtime too (“just”) two weeks. By the way, most datasets aren’t nearly this cruel. TCGA was nicely formatted in tables. One usually can’t control the output file format, but there’s always a solution, even if it involves writing custom code.
4. Open your data files
I have two reasons for suggesting this. The first is less optimistic: your code could have made a subtle but devastating mistake. Across the span of one week, we discovered such errors in three different files. Naturally, this week got very busy. First, a sizable proportion of the raw data had “gone missing” in one data source. After half an hour of wondering where the data went, and an extra two hours of debugging our parser, we resolved to rewrite the code completely. The next day, I was surprised to find three copies of another dataset, complete with column headers and cleaned-up data values, in one file. I had re-run the code without changing file paths, causing the same file to append to itself. Twice.
How coders feel about logic errors. Just two days later, while trying to load PubChem data into a database, a team member realized that the output data had been corrupted. If not for a backup that I saved, I would have spent two weeks rebuilding this file. The second reason is...
5. Because you never know what you may find!
Even though Big Data can be a tricky endeavor, you can learn to love it. I love it through and through.
All of this work went toward an integrated hypergraph database that represents the genetic, pharmacologic, and phenotype nature of thirty-three different types of cancer. With over two terabytes of data, represented as 500 million data points with one billion interconnections in the Neo4j database engine, word on the street says that this database may be one of the largest integrated biological databases ever. After playing with our beta-release database (with “only” one million data points), we at Systems Imagination are sure that, with some smart tools and tricks (including, but certainly not limited to, artificial intelligence), the insights will come flooding in.
It has been both rewarding and humbling to know that I have journeyed with Big Data and SII so intimately. In a few days, I will embark on a new journey: the first year of my college experience.
I will leave you with some pretty pictures of the interconnections that one can find in the complete database. If you find yourself savoring these images–– like I do––then maybe Big Data is in your future. A collection of 100 genes (green) located inside chromosome 5 (gray). Take note of the faintly-visible lines between genes (left side). A network of genes (green) and their interactions with gene ontologies.
- About The Cancer Genome Atlas. The National Institutes of Health. URL: https://cancergenome.nih.gov/abouttcga.
- Want to learn more about TCGA? Check out their lay-audience infographic at this link: https://cancergenome.nih.gov/PublishedContent/Images/images/tcga-infographic-article.__v100154743.png.
- PubChem. National Center for Biotechnology Information. URL: https://pubchemdocs.ncbi.nlm.nih.gov/about.
- HetioNet. URL: https://het.io/
- Pharos. The National Institutes of Health. URL: https://pharos.nih.gov/idg/index
- Charles Babbage Quotes. Goodreads. URL: https://www.goodreads.com/author/quotes/538773.Charles_Babbage.
October 8, 2018