In 2003, the Human Genome Project culminated in the successful sequencing of the more than 3 billion base pairs making up a single human genome, costing an international consortium of researchers 13 years and $3 billion to complete.
Today, similar sequencing can happen in weeks for about $4,000. But soon, science will realize the hallowed “$1,000 genome,” – the symbolic marker of entry into the era of personalized medicine, in which people have their DNA sequenced to help tailor medical care to their specific needs.
Such great progress comes at a cost, however – the sheer amount of space required to store and process all that information. The amount of raw data required to obtain a single human genome can consume many terabytes of storage. There is now so much data that medical science is having a hard time retaining, much less processing it.
But a team led by Stanford electrical engineers has compressed a completely sequenced human genome to just 2.5 megabytes – small enough to attach to an email. The engineers used what is known as reference-based compression, relying on a human genome sequence that is already known and available. Their compression has improved on the previous record by 37 percent. The genome the team compressed was that of James Watson, who co-discovered the structure of DNA more than 60 years ago.
“On the surface, this might not seem like a problem for electrical engineers,” said Tsachy Weissman, an associate professor of Electrical Engineering. “But our work in information theory is guiding the development of new and improved ways to model and compress the incredibly voluminous genomic data the world is amassing.” In addition to Weissman, the team included Golan Yona, a senior research engineer in Electrical Engineering, and Dmitri Pavlichin, a post-doctoral scholar in Applied Physics and Electrical Engineering.
Genomic data compression is necessary for efficient storage, of course, but also for swiftly transferring and communicating data for various post-sequencing applications and analysis that will divine from the genetic information what diseases a person might be suffering from, is susceptible to or is in the process of developing. The analysis also helps determine what therapies and medications might best be suited to a particular person at a particular juncture in time. These are the promises of personalized medicine that effective genomic data compression would enable.
The need to retain as much detail as possible of the raw measurements is particularly acute with genomic data. Imagine discarding an important mutation or, worse, introducing a non-existent one into a patient’s DNA, both of which might adversely influence any number of crucial medical decisions. This would seem to suggest that the only acceptable compression mode is “lossless,” in which all of the data is perfectly retained in the decompression, in contrast to “lossy” compression, in which some of the data is lost or distorted in the decompression.
In music and video compression, for instance, lossy compression systems are able to achieve considerable data size reductions by discarding parts of the signals to which the human ear and or eye are not sensitive. Such loss of information is more than offset by the convenience of being able to carry or stream entire libraries of music to virtually any location in seconds.
Traditionally, the tradeoff in data compression has been between high-quality-but-large files and smaller-but-distorted files. Weissman and his group have recently been working toward disrupting this tradeoff: shrinking file sizes while maintaining or even boosting the data integrity.
“With genomic data, accuracy really matters. But, then again, so does storage and processing time, so we are searching for a solution,” Weissman said.
To achieve even more dramatic levels of compression, Weissman, Idoia Ochoa, a graduate student in the department of Electrical Engineering, and Mikel Hernaez, a postdoctoral fellow in the same department, are pursuing collaborations with researchers from the medical school focused on compressing what are known as the "quality scores” that accompany DNA sequences. Today’s high-speed genetic sequencers provide read outs consisting of segments of a genome’s 3 billion base pairs and include corresponding sequences assessing the reliability of these reads. A quality score is assigned to every base pair in every read, conveying the likelihood that it is correct.
Quality scores are useful for boosting the reliability of the genome assembly process and are particularly important for genotyping, the process of determining differences in the genetic make-up of an individual relative to that of a reference sequence. But storing quality scores significantly increases the file size, often taking the majority of the pre-compression disk space of the raw genome data. Compression of the quality scores can significantly reduce file size as well as speed the transmission, processing and analysis of the data.
In recording quality scores, DNA sequencers introduce all sorts of imperfections that are collectively considered “noise.” Different sequencers have different noise characteristics. Weissman and his team are developing theory and algorithms for processing the quality scores in a way that reduces the noise and at the same time results in significant compression. Counterintuitive as it might sound at first, they are using lossy compression as a mechanism not only for considerable reduction in storage requirements, but also for enhancing the integrity of the data.
“But, in fact, it is quite intuitive,” Weissman said. “Lossy compression, when done right, forces the compressor to discard the part of the signal which is hardest to compress, namely, the noise.”
There is still much more to be done and the potential for significant improvement in lossless compression of genomic data. DNA of individuals in a population, when considered collectively, is in some ways similar to a video file. Each individual’s genome is akin to a frame of a movie. A new frame can be represented very succinctly using previous frames as references.
Fortunately for the human race, there are growing databases of known genomes. One of these, known as the 1,000 Genomes Project, has led to a database containing the full genomes of more than 1,000 people. By using the database as a reference, Weissman and collaborator Tom Courtade, an assistant professor of Electrical Engineering at the University of California, Berkeley, are working with students from both institutions to achieve significant further lossless compression. They’ve already reduced the size of the file needed to represent a new human genome that is not in the database by another order of magnitude, to about 200 kilobytes.
“At this point we feel like we might be approaching the fundamental limits on compression of an individual’s genome given a reference database consisting of genomes of many other individuals,” Weissman said. “Whether or not we are truly approaching this limit is a question we are tackling. The answer will be exciting either way.”