On Big Data & Not Being Evil

By John Speakman, Senior Director, Research Information Technology, NYU Langone Medical Center

John Speakman, Senior Director, Research Information Technology, NYU Langone Medical Center

In 2013, researchers from the Massachusetts Institute of Technology detailed in Science Magazine their successful attempts to reidentify individuals in a “deidentified” genomic dataset published by the National Institutes of Health (NIH), using only publicly accessible data on the Internet. This and other well-publicized instances suggest that “deidentified” data may never truly be so. Currently, NIH grant solicitations require applicants to attest that data will be “fully deidentified.” But the power of combining datasets (Big Data) has gotten ahead of the rules.

The first wave of Big Data to hit biomedicine was in 2003 with the first sequencing of the human genome, which had taken a decade to complete and cost over a billion dollars. Farsighted individuals, however, were already warning of a “tsunami of data” bearing down upon ill-equipped infrastructures. The cost of sequencing has dropped exponentially in recent years, outpacing Moore’s law. In January 2014, the instrument manufacturer Illumina announced a new sequencer with the claim that it can sequence human genomes at a cost of $1,000 for each sequence, in the form of about 250 gigabytes of raw data-storage and analysis not included. Pundits argued the semantics behind this claim, but as those of us in academic medical centers had known for some time, the tsunami is upon us. Using any sequencer, it is assumed that the user already has a robust IT infrastructure of storage, high-performance computing resources, and network bandwidth. Storage in particular is a critical aspect since discarding data once analyzed is not an option, funding agencies and medical journals require researchers to make it available on request. Allocating petabytes of storage for rarely-accessed data indefinitely is not a palatable option either.

Biomedicine is not waiting for us to discover a resolution, it is racing ahead. Organizations such as the Institute for Systems Genetics at NYU Langone Medical Center are establishing biology production lines with the potential to generate petabyte-scale volumes of new data annually. Furthermore, healthcare is preparing for whole-genome sequencing, previously a research activity,to become a routine part of patient care. As a result, genome sequence data will be part of every patient record within the next few years.