Favorite News Sources Across the Political Aisle

What are the most liberal-leaning and conservative-leaning news outlets?

Sometime last presidential election season I had this very thought. All kinds of dirt was being thrown around about both candidates; however, lots of it was coming from news sources I had never heard of. I probably still wouldn’t have if Twitter and Reddit didn’t exist, providing these outlandish stories a platform for mass exposure (and mass outrage).

So I could never really tell if what I was reading was from a legit source, completely spun, or flat-out fake. For example I would see a headline like:

FBI Arrests Hillary on Corruption Charges

Linking to a news outlet calling themselves “The Discovery Examiner Guardian“, or something. I thought, well… if The DEG is like the NYT, Hil-dawg is probably in deep shit. If the DEG is like Breitbart then I’m 99% sure the opposite is true.

So who the F are these guys? Like, in general.

So I googled: What are the most liberal-leaning and conservative-leaning news outlets? To my dismay I found nothing satisfying. No ranking lists curated by experts, no data driven politio-meter, nothing really. Just a bunch of anecdotes from internet people complaining that so-and-so news is like totally bias.

I guess it makes sense given that any corporation attempting to appear as “the news” is trying to woo as many people as possible into believing they are thee most credible straight-shootin’, just-the-facts-you-decide, fair-and-balanced, no-underlying-agenda organization around. So, as nice as it would be, of course Fox News isn’t going to post on their homepage something like, “We are a 8/10 on the left/right political spectrum“.

So I had an idea… Reddit created this mess; let’s see if they can help fix it.

Using Reddit API (PRAW), I wrote python script to identify the favorite news sources of two subreddit communities on opposite ends of the political aisle.

This bot scraped the url from the top daily submissions to the main pro-Trump and anti-Trump subreddit communities, essentially determining these subreddits favorite news outlets. Nota bene: the validity of these data as a litmus for liberal-leaning and conservative-leaning news rests on the assumption that generally people prefer to post and upvote stories that align-with and support their personal world view.

Without further adieu…

UPPER PANEL: pro-Trump subreddit The_Donald
LOWER PANEL: anti-Trump subreddit EnoughTrumpSpam

I cross posted this project on Reddit’s Data is Beautiful, where a Googler, Filipe Hoffa saw my post and took it to the next level. Using data studio he expanded my original idea to all of reddit, and made it interactive. It’s something really worth playing around with for a few minutes. So go check it out!.

You can grab the code I used from this gist (you really don’t want it though, it’s awful)

Thoughts on Special Relativity

If I pursue a beam of light with the velocity c (velocity of light in a vacuum), I should observe such a beam of light as a spatially oscillatory electromagnetic field at rest.  However, there seems to be no such thing, whether on the basis of experience or according to Maxwell’s equations.  From the very beginning it appeared to me intuitively clear that, judged from the standpoint of such an observer, everything would have to happen according to the same laws as for an observer who, relative to the earth, was at rest.  For how, otherwise, should the first observer know, i.e., be able to determine, that he is in a state of fast uniform motion?  One sees that in this paradox the germ of the special relativity theory is already contained.  Today everyone knows, of course, that all attempts to clarify this paradox satisfactorily were condemned to failure as long as the axiom of the absolute character of time, viz., of a simultaneous, unrecognizedly was anchored in the unconscious.  Clearly to recognize this axiom and its arbitrary character really implies already the solution to the problem.

…thought 16 year old Albert Einstein as he scribbled it down in his notebook.

Making whole exome and whole genome sequencing data files really, really, ridiculously smaller.

Recently I’ve been doing some genomics work, using whole genome sequencing (WGS) and whole exome sequencing (WES) data provided by the Alzheimer’s Disease Sequencing Project (ADSP). The goal of this project is to identify DNA variants contributing to early and late onset dementia, and to isolate potentially neuroprotective rare variants and single nucleotide polymorphisms.

I’ve done a bit of genomics analyses in the past, from pre-processed datasets. However this is my first real foray into the thick of genomics big-data; and I’ve made a few observations I’d like to share, including how we shrunk a 70 GB dataset into a 900 MB text file, without losing any information. I will get to those in a minute. First, I just want to say that more people need to jump onto the Alzheimer’s genomics bandwagon, which is currently passing through the wild-west. It is nowhere near the state of cancer genomics. You may be thinking – ‘rightly so. cancer is a much bigger problem’; and it probably is. But to summarize the AD problem:

  • If you live long enough, you will get Alzheimer’s

On the plus side, there now exists a pretty massive AD sequencing dataset. It includes the genomes of over 5000 Alzheimer’s disease (AD) patients and 8000 age-matched (or older) controls. The data is housed at The National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS). To get this data one needs an eRA commons login, a dbGaP repository key (files are encrypted), and the SRA Toolkit to retrieve and decrypt files from dbGaP.

  • Genotype Data: raw sequencing read and mapping BAM files stored in SRA format, along with QC genotypes in PLINK and variant call format (VCF).
  • Phenotype Data: phenotypic information from consented study subjects and family members, and pedigree records of genealogical relationships.

Based on the phenotype data, we identified a handful of individuals over 90 years old with known risk factors for AD, who somehow escaped developing any symptoms of dementia. For example on chromosome 19 there is a gene called APOE; the ε4 allele of these gene is linked to late-onset AD. APOE codes for a protein that assists in the biosynthesis of apolipoprotein E, a cholesterol transporter that aggregates and clears amyloid deposits in the brain. Reduced function of this gene results in beta-amyloid plaque build-up, a highly marker highly correlated with AD. The ε4 allele is a well documented risk factor of AD, and the incidence of AD is 10-fold higher in patients with the double variant, APOE 4/4, than in those with the single variant. Indeed this is reflected in the ADSP dataset. However, there is a small cohort of individuals over 90 years old who had the APOE 4/4 allele with no psychological or physical symptoms, confirmed by interview and autopsy.

  • What protected these people with APOE 4/4 allele from getting Alzheimer’s Disease?

Perhaps there’s a common genetic thread. Next step: acquire gene sequencing data from ADSP…

I just need to download a mere 190 GB of text data. That’s absurd. Mind you these are just the files for single nucleotide polymorphisms are rare variants. I downloaded the smallest one to find out what on earth is going on in there.


There is a variant in every row and a person in every column – yeep. It doesn’t make sense to format the data this way, given rare variants are rare. Also, I don’t care when someone doesn’t have a variant (0/0) or a low quality read (./.) at a particular location (maybe for some back-alley pileup statistics). I only want to know if a person did have a variant at a particular location.

Let’s fire up perl and put those instances in another file. I think the best way to format this data is to put everyone who had a variant on the same row.

Let’s see if this lossless conversion saves any space…

ALL_SPARSE_FILES_c1.tar.gz 19-Jan-2017 23:27 2.9G
ALL_SPARSE_FILES_c2.tar.gz 19-Jan-2017 20:50 701M
ALL_SPARSE_FILES_c3.tar.gz 19-Jan-2017 20:55 280M
ALL_SPARSE_FILES_c4.tar.gz 19-Jan-2017 20:56 99M
ALL_SPARSE_FILES_c5.tar.gz 19-Jan-2017 20:57 613M
ALL_SPARSE_FILES_c6.tar.gz 19-Jan-2017 21:01 334M

Wow, that 70 GB file is now 2.9 GB.

More to come…