Big Data Series Blog 1

Opt-in to add your information to our expanding data sets. This information is critical for scientists to advance research for our fellow man. Sound familiar?  
In so many words, these requests are ingrained in wearable devices, exercise apps, or kits that promote deep dives into family genealogy. Of course, most people want to do the right thing and if their information can contribute to the greater good, then why not participate? Their willingness to contribute generally pivots on the assumption of anonymity. Companies anticipate this concern and have cultivated a uniform response about the low probability of data privacy breaches in their big data sets because their de-identifying techniques meet the highest compliance standards with current U.S. regulations (1).
The looming question remains: is that a high enough threshold to maintain perpetual anonymity in a world of rapidly growing artificial intelligence? Short answer, no.  
It has been well established that reidentification can occur on data related to social media (2) and genetic data (3). These areas are prone to reidentification due to data sparsity; meaning “a large number of characteristics for each individual, which leads to a diversity of combinations in such a way that any particular combination of the data is identifying (4).” In other words, large amounts of data are either maintained or unique sequences are identified, both of which can create an easy map back to the original person (5).  
Historically, other health information areas, such as wearables, have posed unique challenges because the data present in those sets are subject to a high variability (6). Due to this variability coupled with a lower amount of data sparsity, companies have been touting sharing health information as essentially a no-risk scenario; however, this is simply false.  
Machine learning is moving at a rapid pace and will be able to reidentify information faster and with fewer data points (7). We must recognize the reality of technological advancements in regard to reidentifying data.  
Multiple studies have been conducted that evaluate the feasibility of machine learning to reidentify information using datasets with various amounts of data sparsity. For example, in a 2018 study published in JAMA, researchers tested the accuracy of random forest and linear SVM algorithms to reidentify information from datasets consisting of demographics from the National Health and Nutrition Examination Survey from 2003-2004 and 2005-2006 and partially 20-minute aggregated physical activity (8).
Two algorithms using two sets of demographics, paired with the 20-minute aggregated physical activity data using the two algorithms. The random forest algorithm successfully reidentified 93.8% of adults and 85.5% of children (2003-2004), and 94.9% of adults and 87.5% of children (in 2005-2006). The linear SVM algorithm successfully reidentified 85.6% of adults and 69.8% of children (2003-2004) and 84.8% of adults and 67.2% of children (2005-2006). These high success rates give credence to the foundational fault present in the low-risk mantra proclaimed by companies reliant on big data.  
Eventually, the United States will need to reckon with the inevitable cultural crossroad. Do we redefine what is considered taboo regarding medical history? Or remain steadfast in our taboos and innovate stronger encryption solutions to amend/create policies that significantly minimize the potential for a data privacy breach? Time will tell.  

(1) This blog does not focus on issues with current policies, such as HIPAA. A few interesting articles are cited here: Scarola, Elizabeth & Shah, Alaap. Erosion of Anonymity: Mitigating the Risk of Re-Identification of De-identified Health Data. Health Law Advisor – Thought Leaders on Laws and Regulations Affecting Health Care and Life Sciences, 2019 February 28. September 21, 2020; United States Department of Health & Human Services. Standards for Privacy of Individually Identifiable Health Information. United States Department of Health and Human Services, 26 July 2013. September 21, 2020; Malin, B., Benitez, K. & Masys, D. Never tooold for anonymity: a statistical standard for demographic data sharing via the HIPAA privacy rule. J. Am. Med. Inform. Assoc. 18, 3–10 (2011).
(2) Narayanan A, Shmatikov V. De-anonymizing social networks. In: SP ’08 Proceedings of the 2008 IEEE Symposium on Security and Privacy. Washington, DC:IEEE Computer Society; 2009:173-187.
(3) GymrekM, McGuire AL, Golan D, Halperin E, Erlich Y. Identifying personal genomes by surname inference. Science. 2013;339(6117):321-324.doi:10.1126/science.1229566.
(4) Na L, Yang C, Lo C, Zhao F, Fukuoka Y, Aswani A.Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use ofMachine Learning. JAMA Netw Open. 2018;1(8):e186040.doi:10.1001/jamanetworkopen.2018.6040.
(5) Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y. Identifying personal genomes by surname inference. Science.2013;339(6117):321-324. doi:10.1126/science.1229566.
(6) Na L, Yang C, Lo C, Zhao F, Fukuoka Y, Aswani A. Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning. JAMA Netw Open. 2018;1(8):e186040.doi:10.1001/jamanetworkopen.2018.6040.
(7) Rocher, L., Hendrickx, J.M. & de Montjoye, Y.Estimating the success of re-identifications in incomplete datasets using generative models. Nat Commun 10, 3069 (2019).
(8) Study discusses other measurements of aggregated physical activity, but the primary conclusions revolve around 20-minute aggregated physical activity.

Posted on

October 5, 2020






Years of Experience


Positive Feedback


Sleepless Hours


Cups of Coffee