Thoughts on Special Relativity

If I pursue a beam of light with the velocity c (velocity of light in a vacuum), I should observe such a beam of light as a spatially oscillatory electromagnetic field at rest.  However, there seems to be no such thing, whether on the basis of experience or according to Maxwell’s equations.  From the very beginning it appeared to me intuitively clear that, judged from the standpoint of such an observer, everything would have to happen according to the same laws as for an observer who, relative to the earth, was at rest.  For how, otherwise, should the first observer know, i.e., be able to determine, that he is in a state of fast uniform motion?  One sees that in this paradox the germ of the special relativity theory is already contained.  Today everyone knows, of course, that all attempts to clarify this paradox satisfactorily were condemned to failure as long as the axiom of the absolute character of time, viz., of a simultaneous, unrecognizedly was anchored in the unconscious.  Clearly to recognize this axiom and its arbitrary character really implies already the solution to the problem.

…thought 16 year old Albert Einstein as he scribbled it down in his notebook.

Making whole exome and whole genome sequencing data files really, really, ridiculously smaller.

Recently I’ve been doing some genomics work, using whole genome sequencing (WGS) and whole exome sequencing (WES) data provided by the Alzheimer’s Disease Sequencing Project (ADSP). The goal of this project is to identify DNA variants contributing to early and late onset dementia, and to isolate potentially neuroprotective rare variants and single nucleotide polymorphisms.

I’ve done a bit of genomics analyses in the past, from pre-processed datasets. However this is my first real foray into the thick of genomics big-data; and I’ve made a few observations I’d like to share, including how we shrunk a 70 GB dataset into a 900 MB text file, without losing any information. I will get to those in a minute. First, I just want to say that more people need to jump onto the Alzheimer’s genomics bandwagon, which is currently passing through the wild-west. It is nowhere near the state of cancer genomics. You may be thinking – ‘rightly so. cancer is a much bigger problem’; and it probably is. But to summarize the AD problem:

  • If you live long enough, you will get Alzheimer’s

On the plus side, there now exists a pretty massive AD sequencing dataset. It includes the genomes of over 5000 Alzheimer’s disease (AD) patients and 8000 age-matched (or older) controls. The data is housed at The National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS). To get this data one needs an eRA commons login, a dbGaP repository key (files are encrypted), and the SRA Toolkit to retrieve and decrypt files from dbGaP.

  • Genotype Data: raw sequencing read and mapping BAM files stored in SRA format, along with QC genotypes in PLINK and variant call format (VCF).
  • Phenotype Data: phenotypic information from consented study subjects and family members, and pedigree records of genealogical relationships.

Based on the phenotype data, we identified a handful of individuals over 90 years old with known risk factors for AD, who somehow escaped developing any symptoms of dementia. For example on chromosome 19 there is a gene called APOE; the ε4 allele of these gene is linked to late-onset AD. APOE codes for a protein that assists in the biosynthesis of apolipoprotein E, a cholesterol transporter that aggregates and clears amyloid deposits in the brain. Reduced function of this gene results in beta-amyloid plaque build-up, a highly marker highly correlated with AD. The ε4 allele is a well documented risk factor of AD, and the incidence of AD is 10-fold higher in patients with the double variant, APOE 4/4, than in those with the single variant. Indeed this is reflected in the ADSP dataset. However, there is a small cohort of individuals over 90 years old who had the APOE 4/4 allele with no psychological or physical symptoms, confirmed by interview and autopsy.

  • What protected these people with APOE 4/4 allele from getting Alzheimer’s Disease?

Perhaps there’s a common genetic thread. Next step: acquire gene sequencing data from ADSP…

I just need to download a mere 190 GB of text data. That’s absurd. Mind you these are just the files for single nucleotide polymorphisms are rare variants. I downloaded the smallest one to find out what on earth is going on in there.

 

There is a variant in every row and a person in every column – yeep. It doesn’t make sense to format the data this way, given rare variants are rare. Also, I don’t care when someone doesn’t have a variant (0/0) or a low quality read (./.) at a particular location (maybe for some back-alley pileup statistics). I only want to know if a person did have a variant at a particular location.

Let’s fire up perl and put those instances in another file. I think the best way to format this data is to put everyone who had a variant on the same row.

Let’s see if this lossless conversion saves any space…

ALL_SPARSE_FILES_c1.tar.gz 19-Jan-2017 23:27 2.9G
ALL_SPARSE_FILES_c2.tar.gz 19-Jan-2017 20:50 701M
ALL_SPARSE_FILES_c3.tar.gz 19-Jan-2017 20:55 280M
ALL_SPARSE_FILES_c4.tar.gz 19-Jan-2017 20:56 99M
ALL_SPARSE_FILES_c5.tar.gz 19-Jan-2017 20:57 613M
ALL_SPARSE_FILES_c6.tar.gz 19-Jan-2017 21:01 334M

Wow, that 70 GB file is now 2.9 GB.

More to come…

Do you even design?

In Don Norman’s book (The Design of Everyday Things) Prof. Norman focuses on how people ‘do things’, like interact with tools and technology, and how they evaluate their actions. He also highlights the role of the designer, as someone who must consider the psychology of human-device interactions. Don suggests that a well-designed product is intuitive to use (i.e. without ever having used such a device, a person will have some intuition as to how it functions), and easy to evaluate (i.e. manipulations performed on/with the device result in predictable outcomes). Accordingly, he states that the role of the designer is to help people bridge the “gulfs” between execution and evaluation (execution: where they try to figure out how it operates; evaluation: where they try to figure out what happened).

Let’s say you’re a designer tasked with sending a fire-producing device back in time, for use by prehistoric man. You’ve studies the following contemporary devices used to produce fire:

light

Which of these devices do you send back in time? What improvements/changes would you make to one or more of these items?

What design elements of these items will help bridge the Gulf of Execution? Gulfs

(N.B.: The Gulf of Execution reflects the degree of difficulty in understanding how a device functions. We bridge the Gulf of Execution through the use of signifiers, constraints, mappings, and a conceptual model.)

What design elements of these items will help bridge the Gulf of Evaluation?

(N.B.: The Gulf of Evaluation reflects the amount of effort that the person must make to interpret the physical state of the device and to determine how well the expectations and intentions have been met. We bridge the Gulf of Evaluation through the use of feedback and a conceptual model.)

The caveman has just stumbled upon your cache of futuristic-looking items; considering Don’s 7-Stages of Action and 3-Stages of Processing…

Gulfs

…Do you expect a problematic ‘bottleneck’ or ‘gulf’ arising at a particular stage?

Finally, which one would you prefer using? If there is a reasonable expectation that someone will be around who is knowledgeable about the device, does that change anything from a design perspective? If something is incredibly easy to use, but only after explicit instruction, is it still a bad design?

(post by Brad Monk)