has become one of the biggest social forums on the web, which is impressive considering its user interface has a very 1990s look and feel. Given Reddit’s popularity despite feeling like an “old school” forum (and since I’m an active redditor myself) I wanted to figure out their secret ingredient, so I took a peek at their source code (which is publicly available on github).
Their algorithms seem straightforward to understand and implement, but for now I’ll focus on the post ranking algorithms (not the comment ranking algorithms; although Reddit’s comment ranking algorithm is quite interesting and the idea guy behind it is Randall Munroe, author of xkcd).
Post ranking code
Reddit is open sourced and the code is freely available. Reddit is implemented in Python and their code is located here. Their sorting algorithms are implemented in Pyrex, which is a language to write Python C extensions. They have used Pyrex for speed reasons. I have rewritten their Pyrex implementation into pure Python since it’s easier to read.
The default story algorithm called the hot ranking is implemented like this:
from datetime import datetime, timedelta
from math import log
epoch = datetime(1970, 1, 1)
"""Returns the number of seconds from the epoch to date."""
td = date - epoch
return td.days * 86400 + td.seconds + (float(td.microseconds) / 1000000)
def score(ups, downs):
return ups - downs
def hot(ups, downs, date):
"""The hot formula. Should match the equivalent function in postgres."""
s = score(ups, downs)
order = log(max(abs(s), 1), 10)
sign = 1 if s > 0 else -1 if s < 0 else 0
seconds = epoch_seconds(date) - 1134028003
return round(order + sign * seconds / 45000, 7)
In mathematical notation the hot algorithm looks like this (I have this from SEOmoz, but I doubt they are the author of this):
Effects of submission time
Following things can be said about submission time related to story ranking:
Submission time has a big impact on the ranking and the algorithm will rank newer stories higher than older
The score won’t decrease as time goes by, but newer stories will get a higher score than older. This is a different approach than the Hacker News’s algorithm which decreases the score as time goes by
Here is a visualization of the score for a story that has same amount of up and downvotes, but different submission time:
The logarithm scale
Reddit’s hot ranking uses the logarithm function to weight the first votes higher than the rest. Generally this applies:
The first 10 upvotes have the same weight as the next 100 upvotes which have the same weight as the next 1000 etc...
Here is a visualization:
Without using the logarithm scale the score would look like this:
Effects of downvotes
Reddit is one of the few sites that has downvotes. As you can read in the code a story’s “score” is defined to be:
up_votes - down_votes
The meaning of this can be visualized like this:
This has a big impact for stories that get a lot of upvotes and downvotes (e.g. controversial stories) as they will get a lower ranking than stories that just get upvotes. This could explain why kittens (and other non-controversial stories) rank so high :)
Conclusion of Reddit’s story ranking
Submission time is a very important parameter, generally newer stories will rank higher than older
The first 10 upvotes count as high as the next 100. E.g. a story that has 10 upvotes and a story that has 50 upvotes will have a similar ranking
Controversial stories that get similar amounts of upvotes and downvotes will get a low ranking compared to stories that mainly get upvotes