Making whole exome and whole genome sequencing data files really, really, ridiculously smaller.

Recently I’ve been doing some genomics work, using whole genome sequencing (WGS) and whole exome sequencing (WES) data provided by the Alzheimer’s Disease Sequencing Project (ADSP). The goal of this project is to identify DNA variants contributing to early and late onset dementia, and to isolate potentially neuroprotective rare variants and single nucleotide polymorphisms.

I’ve done a bit of genomics analyses in the past, from pre-processed datasets. However this is my first real foray into the thick of genomics big-data; and I’ve made a few observations I’d like to share, including how we shrunk a 70 GB dataset into a 900 MB text file, without losing any information. I will get to those in a minute. First, I just want to say that more people need to jump onto the Alzheimer’s genomics bandwagon, which is currently passing through the wild-west. It is nowhere near the state of cancer genomics. You may be thinking – ‘rightly so. cancer is a much bigger problem’; and it probably is. But to summarize the AD problem:

  • If you live long enough, you will get Alzheimer’s

On the plus side, there now exists a pretty massive AD sequencing dataset. It includes the genomes of over 5000 Alzheimer’s disease (AD) patients and 8000 age-matched (or older) controls. The data is housed at The National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS). To get this data one needs an eRA commons login, a dbGaP repository key (files are encrypted), and the SRA Toolkit to retrieve and decrypt files from dbGaP.

  • Genotype Data: raw sequencing read and mapping BAM files stored in SRA format, along with QC genotypes in PLINK and variant call format (VCF).
  • Phenotype Data: phenotypic information from consented study subjects and family members, and pedigree records of genealogical relationships.

Based on the phenotype data, we identified a handful of individuals over 90 years old with known risk factors for AD, who somehow escaped developing any symptoms of dementia. For example on chromosome 19 there is a gene called APOE; the ε4 allele of these gene is linked to late-onset AD. APOE codes for a protein that assists in the biosynthesis of apolipoprotein E, a cholesterol transporter that aggregates and clears amyloid deposits in the brain. Reduced function of this gene results in beta-amyloid plaque build-up, a highly marker highly correlated with AD. The ε4 allele is a well documented risk factor of AD, and the incidence of AD is 10-fold higher in patients with the double variant, APOE 4/4, than in those with the single variant. Indeed this is reflected in the ADSP dataset. However, there is a small cohort of individuals over 90 years old who had the APOE 4/4 allele with no psychological or physical symptoms, confirmed by interview and autopsy.

  • What protected these people with APOE 4/4 allele from getting Alzheimer’s Disease?

Perhaps there’s a common genetic thread. Next step: acquire gene sequencing data from ADSP…

I just need to download a mere 190 GB of text data. That’s absurd. Mind you these are just the files for single nucleotide polymorphisms are rare variants. I downloaded the smallest one to find out what on earth is going on in there.


There is a variant in every row and a person in every column – yeep. It doesn’t make sense to format the data this way, given rare variants are rare. Also, I don’t care when someone doesn’t have a variant (0/0) or a low quality read (./.) at a particular location (maybe for some back-alley pileup statistics). I only want to know if a person did have a variant at a particular location.

Let’s fire up perl and put those instances in another file. I think the best way to format this data is to put everyone who had a variant on the same row.

Let’s see if this lossless conversion saves any space…

ALL_SPARSE_FILES_c1.tar.gz 19-Jan-2017 23:27 2.9G
ALL_SPARSE_FILES_c2.tar.gz 19-Jan-2017 20:50 701M
ALL_SPARSE_FILES_c3.tar.gz 19-Jan-2017 20:55 280M
ALL_SPARSE_FILES_c4.tar.gz 19-Jan-2017 20:56 99M
ALL_SPARSE_FILES_c5.tar.gz 19-Jan-2017 20:57 613M
ALL_SPARSE_FILES_c6.tar.gz 19-Jan-2017 21:01 334M

Wow, that 70 GB file is now 2.9 GB.

More to come…