It started with a question: “Are movies getting longer and longer?”. Spoiler:
Not really, except maybe in the last couple of years.
I used Wikidata’s online query service to export all movies then
filtered those with both a publication date and a duration. This gave me a
large JSON which I processed using Python in order to extract a couple
numbers for each year: min, max, median, first and third quartiles.
The result fits in a small JSON file, which I then used to build a
D3 using a few lines of JS. I used colorbrewer2 to find a
colorblind-safe color palette.
You can see the result as well as the JS code on Observable.
To avoid outliers such as “Modern Times Forever” (240 hours) or
“The Burning of the Red Lotus Temple”, I used the interquartile
range (IQR) to limit the size of the bars: any movie whose duration is
below Q1-1.5×IQR or above Q3+1.5×IQR (where Q1 is the first quartile and
Q3 the third one) is not shown.
Results
As one can see on the graph, the median duration quickly rises from 50 to 95
minutes from the 1920s to the 1960s, then doesn’t move much except in the last
two years.
Limitations
The first obvious limitation is the data: Wikidata has 200k+ movies but only
73k have both a publication date and a duration. It’s not complete enough to
let me filter by movie type; e.g. feature film vs. others.
IMDb lists 5.3M titles (most of which are TV episodes), but there’s no way
to export them all.
In the end, there’s no way to know how representative Wikidata’s movies dataset
is. It does give a hint, but this graph is not a definitive answer to the
original question.
Mattermost is a Slack-like self-hosted and open-source alternative. We
use it at work but for some reason link previews don’t work. Before diving into
Mattermost’s internals I wanted to see if I could write a quick workaround
using the fact that Mattermost does show an image if you post a link ending
with .png or .jpg.
The Current Situation
When you post an image link, Mattermost makes a request to show it in the
application. It detects those images using a regexp; not by e.g.
sending a HEAD request to get the content type. If you have an image URL that
doesn’t end with common extentions Mattermost won’t show it.
Mattermost doesn’t serve you a preview of the image; it rather gives you an
img with the original URL. That means every single person reading the channel
will request the image from its original location. Slack, on the other hand,
fetch images, cache them, and serves them from its own domain,
https://slack-imgs.com. Slack uses a custom user-agent for its request so you
know where it comes from.
User-Agent: Slackbot-LinkExpanding 1.0 (…)
Mattermost, on the other hand, can’t use a custom user-agent because the
request is done by your browser. The only thing distinguishing Mattermost’s
request for a preview and any other request is it asks for an image:
Accept: image/webp,image/*,*/*;q=0.8
The header above is Chrome telling the Web server it can deal with WebP
images, then images in any format, then anything; in that order. Note
it explicitly says it accepts WebP images because some
browsers don’t support the format.
Unfortunately not all browsers are explicit. Firefox sends Accept: */* since
Firefox 47 and so did IE8 and earlier versions. In those cases we can’t
really do anything beside complicated rules based on the user-agent and other
headers.
Proxying Requests
If we know how to tell if a request comes from Mattermost asking for an image
preview rather than a “normal” user we can serve different contents to them: a
link preview as an image to Mattermost, and the normal content to the user.
All we have to do is to make some sort of intelligent proxy. Using Flask we
can make something like this:
fromflaskimportFlaskapp=Flask(__name__)@app.route("/<path:path>/p.png")defmain(path):ifnotrequest.headers.get("Accept","").startswith("image/"):returnredirect(path)return"Hello, I'm an image"
This is a small web application that takes any route in the form of
/<url>/p.png and either redirects you to that URL if your Accept header
doesn’t start with image/; either serves you a Hello, I'm an image page.
All we have to do now is to return an actual image in lieu of that placeholder
text. I used Requests and Beautiful Soup to fetch and parse
webpages, and Pillow to generate images.
When one requests http://example.com/http://...some...url.../p.png, the
app fetches that URL; parses it to extract its title and some excerpt; write
that on a blank image; and serves it.
Extracting the title is as easy as grabbing that title HTML tag. If
available, I also try the og:title and twitter:titlemeta tags. If none
of those are available, I fallback on the first h1 or h2 tag.
Getting an usable excerpt is not too hard; here again I search for common
meta tags: description, og:description, twitter:description. If I can’t
get any of them I take the first p that looks long enough.
The Pillow library makes it easy to write text on an image. I replaced the
(ugly) default font with Alegreya. The tricky part is mostly to fit the
text on the image. I used a combination of Draw#textsize
to get the dimensions of some text as if it were written on the image and
Python’s textwrap module to cut the title and excerpt so that
they wouldn’t cross the right side of the image.
I used fixed dimensions for all images (400×70) and kept a small padding along
their sides. Previews with small or missing excerpts get some unused white
space at the bottom; this could be fixed by pre-computing the final size of the
image before creating it.
Google.com’s generated previewGitHub repos have a very small excerpt
On most websites the preview works fine. We could tweak the text size as well
as add a favicon or an extracted image.
This post’s generated preview
Conclusion
In the end it took a couple hours to have a working prototype. Most of that
time was spent dealing with encoding issues and trying various webpages to find
edge cases.
The result is acceptable; it has the issue of not being accessible at all but
that’s still better than nothing.
You’ve probably read this instruction quite a few times in online tutorials
about installing command-line tools. What is this PATH for? Why should we
add directories “to” it? This is the subject of this post.
How the Shell Finds Executables
When you type something like ls, your shell has to find what is this ls
program you’re trying to run. Typical shells only have a handful predefined
commands like cd; most of the commands you use everyday are standalone
programs.
In order to find that ls program; your shell looks in a few directories.
Those directories are stored in an environment variable called PATH; you can
look up its value in most shells using the following command:
echo$PATH
You can see it contains a list of paths separated by colons. Finding a program
in these is just a matter of checking each directory to see if it contains an
executable with the given name. This lookup is implemented by the which
program. The algorithm is pretty simple and goes like this:
cmd = "ls"
for directory in $PATH.split(":") {
candidate = $directory/$cmd ;
if candidate exists and is executable {
return candidate ;
}
}
This is pseudo-code, but there are implementations in various languages:
C, Go, Ruby, Dart, etc.
As you can see, the order of these directories matters because the first
matching candidate is used. That means if you have a custom ls program in
/my/path, putting /my/path at the beginning of the PATH variable will
cause ls to always refer to your version instead of e.g. /bin/ls because
/bin appears after/my/path.
You can perform this lookup using the which command. Add -a to get all
matching executables in the order in which they’re found. This is what
which python gives on my machine:
$ which python
/usr/local/bin/python
$ which -a python
/usr/local/bin/python
/usr/bin/python
You can see I have two python’s but the shell picks the one in
/usr/local/bin instead of /usr/bin because it appears first in my PATH.
You can bypass this lookup by giving a path to your executable. This is why
you run executables in the current directory by prefixing them with ./:
./my-program
This tells the shell you want to run the program called my-program in the
current directory. It won’t search in the PATH for it. It also works with
absolute paths. The following command runs my python in /usr/bin regardless
of what’s in my PATH variable:
/usr/bin/python
For performance reasons a shell like Bash won’t look executables up all the
time. It’ll cache the information for the current session and will hence do
this lookup only once per command. This is why you must reload your shell to
have your PATH modifications taken into account. You can force Bash to clear
its current cache with the hash builtin:
hash-r
Now that we know how our shell find executables; let’s see how this PATH
variable is populated.
Where Does That PATH Come From?
This part depends on both your shell and your operating system. Bash reads
/etc/profile when it starts. It contains some setup instructions, including
initial values for the PATH variable. On macOS, it executes
/usr/libexec/path_helper which in turns looks in /etc/paths for the initial
paths.
The file looks like this on my machine:
$ cat /etc/paths
/usr/bin
/bin
/usr/sbin
/sbin
The actual code to set the PATH variable (or any variable for that matter) in
Bash is below:
Technically Bash doesn’t need you to export the PATH variable to use it but
it’s better if for example a program you use executes another program; in this
case the former must be able to find the latter using the correct PATH.
How Do We Modify It?
Each shell has its own file in the user directory to allow per-user setup
scripts. For Bash, it’s ~/.bash_profile, which often sources ~/.bashrc.
You can use this file to override the default PATH. It’ll be loaded when
starting a session; meaning you have to either reload your shell either
re-source this file after modifying it.
We saw in the previous section how to set the PATH variable; but most of the
time we don’t want to manually set the whole directories list; we only want to
modify its value to add our own directories.
We won’t dive into the details in that post, but Bash has a syntax to get the
value of a variable by prefixing it with a dollar symbol:
echo myvar # prints "myvar"echo$myvar# prints "foo", i.e. myvar's value
Bash also supports string interpolation using double quotes: You can include
the value of a variable in a double-quotes string by just writing $ followed
by its name:
echo"hello I'm $myvar"# prints "hello I'm foo"
We use this feature to append or prepend directories to the PATH variable:
prepending means setting the PATH’s value to that directory followed by a
colon followed by the previous PATH’s value:
PATH="/my/directory:$PATH"
You usually don’t need to re-mark this variable as exported but using export
at the beginning of the command doesn’t hurt.
Wrapping Things Up
Modifying the PATH is not something we do very often because most tools are
installed in standard locations—already in our PATH. Most package managers
install executables in their own location and need the user to modify their
PATH. Homebrew, for example, installs them under /usr/local/bin and
/usr/local/sbin by default. If those are not already in the PATH, one needs
to add them:
# In e.g. ~/.bash_profileexport PATH="/usr/local/bin:/usr/local/sbin:$PATH"
This means the shell will first look in these directories for executables. It
allows one to “override” existing tools with more up-to-date ones.