Code-Golf is the art of writing the shortest program in a given language that
implements some given algorithm. It started in the 90’s in the Perl community
and spread to other languages; there are now languages dedicated to
code-golfing and StackExchange has a Q&A website for it.
4clojure is a well-known website to learn Clojure
through exercises of increasing difficulty, but it has a lesser-known code-golf
challenge which you can enable by clicking on “Leagues” in the top menu.
If you check the code-golf checkbox, you then get a score on each problem that
is the number of non-whitespace characters of your solution; the smaller the
better.
The first thing you’ll note when code-golfing is that the reader syntax for
anonymous functions is a lot shorter than using fn:
Unfortunately you can’t have a reader-syntax function inside another
reader-syntax one, so you often have to transform the code not to use anonymous
functions.
for is a very powerful tool for that, because it allows you to do the
equivalent of map, and a lot more, with no function:
On some problems it can even be shorter than using map + filter:
Some core functions are equivalent in some contexts and so the shorter one can
substitute a longer one:
When you must use a long function name in multiple places, it might be shorter
to let that function with a one-letter symbol:
Other tricks
Use indexed access on vectors:
Use set literals as functions:
Inverse conditions to use shorter functions:
Inlined code is sometimes shorter:
Use 1 instead of :else/:default in cond:
Use maps instead of ifs for conditions on equality (this one really makes
the code harder to read):
The first step was to find a decent API for the Internet Archive. It
supports Memento, an HTTP-based protocol defined in the RFC 7089 in
Using the memento_client wrapper, we can get the closest snapshot
of a website at a given date with the following Python code:
Don’t forget to install the memento_client lib:
Note this gives us the closest snapshot, so it might not be exactly two
weeks ago.
We can use this code to loop using an increasing time delta in order to get
snapshots at different times. But we don’t only want to get the URLs. We wants
to make a screenshot of each one.
The easiest way to programmatically take a screenshot of a webpage is probably
to use Selenium. I used Chrome as a driver; you can either download
it from the ChromeDriver website or run the following command
if you’re on a Mac with Homebrew:
We also need to install Selenium for Python:
The code is pretty short:
If you run the code above, you should see a Chrome window open, go at the URL
by itself, then close once the page is fully charged. You now have a screenshot
of this page in stackoverflow_20181119211854.png! However, you’ll quickly
notice the screenshot includes the Wayback Machine’s header over the top of the
website:
This is handy when browsing through snapshots by hand, but not so much when we
access them from Python.
Fortunately, we can get a header-less URL by changing it a bit: we can
append id_ to the end of the date in order to get the page exactly as it was
when the bot crawled it. However, this means it links to CSS and JS files that
may not exist anymore. We can get a URL to an archived page that has been
slightly modified to replace links with their archived version using im_
instead.
Page with header and rewritten links:
https://web.archive.org/web/20181119211854/...
Original page, as it was when crawled:
https://web.archive.org/web/20181119211854id_/...
Original page with rewritten links:
https://web.archive.org/web/20181119211854im_/...
Re-running the code using the modified URL gives us the correct screenshot:
Joining the two bits of code we can make screenshots of a URL at different
intervals. You may want to check the images once it’s done to remove
inconsistencies. For example, the archived snapshots of Google’s homepage
aren’t all in the same language.
Once we have all images, we can generate a gif using Imagemagick:
I used the following parameters:
-delay 50: change frame every 0.5s. The number is in 100th of a second.
-loop 1: loop only once over all the frames. The default is to make an
infinite loop but it doesn’t really make sense here.
You may want to play with the -delay parameter depending on how many images
you have as well as how often the website changes.
I also made a version with Google (~10MB) at 5 frames per second,
with -delay 20. I used the same delay
as the StackOverflow gif: at least 5 weeks between each screenshot. You
can see which year the screenshot is from by looking at the bottom of each
image.
It started with a question: “Are movies getting longer and longer?”. Spoiler:
Not really, except maybe in the last couple of years.
I used Wikidata’s online query service to export all movies then
filtered those with both a publication date and a duration. This gave me a
large JSON which I processed using Python in order to extract a couple
numbers for each year: min, max, median, first and third quartiles.
The result fits in a small JSON file, which I then used to build a
D3 using a few lines of JS. I used colorbrewer2 to find a
colorblind-safe color palette.
You can see the result as well as the JS code on Observable.
To avoid outliers such as “Modern Times Forever” (240 hours) or
“The Burning of the Red Lotus Temple”, I used the interquartile
range (IQR) to limit the size of the bars: any movie whose duration is
below Q1-1.5×IQR or above Q3+1.5×IQR (where Q1 is the first quartile and
Q3 the third one) is not shown.
Results
As one can see on the graph, the median duration quickly rises from 50 to 95
minutes from the 1920s to the 1960s, then doesn’t move much except in the last
two years.
Limitations
The first obvious limitation is the data: Wikidata has 200k+ movies but only
73k have both a publication date and a duration. It’s not complete enough to
let me filter by movie type; e.g. feature film vs. others.
IMDb lists 5.3M titles (most of which are TV episodes), but there’s no way
to export them all.
In the end, there’s no way to know how representative Wikidata’s movies dataset
is. It does give a hint, but this graph is not a definitive answer to the
original question.