Baptiste Fontaine’s Blog  (back to the website)

Another blog

In 2023 I wrote only one article on this blog, but a few more on the tech blog of the company I currently work for, Bixoto:

This explains why the present blog might look abandoned. It’s not, it’s just that over the years I’ve reduced the time I spend on side projects and now 95% of my time doing technical stuff on a computer is during work hours.

Per-project gsutil service accounts

When using any library for Google Cloud you can specify a service account with GOOGLE_APPLICATION_CREDENTIALS, but unfortunately that doesn’t work when using gsutil in shell scripts. The documentation suggests to use gcloud auth activate-service-account, but that “activates” the service account for all gsutil invocations, and doesn’t work if you installed a standalone version of gsutil —without gcloud.

I wanted to have one service account per project so that each project has access to the relevant resources only. The solution I found is to use a Boto file: this is a ini-like file format used for AWS configuration, but gsutil also supports it. You can tell gsutil to find such file with BOTO_CONFIG or give it a list of paths to look in with BOTO_PATH.

In a simple project where the main code is a shell script, the setup would look like this:

$ ls -a
.boto
script.sh
service-account.json

In .boto:

[Credentials]
gs_service_key_file=/app/service-account.json

In script.sh:

#!/bin/env bash -e
export BOTO_CONFIG=/app/.boto
gsutil ...

This is a bit cumbersome compared to GOOGLE_APPLICATION_CREDENTIALS but it works well.

Unique Strings and Unicode

For the past three years I’ve been running a Twitter bot called @lemondeadit, a French equivalent of @NYT_first_said which tweets new words as they appear in the French newspaper Le Monde.

The account was created in June 2019 but the first commit dates back to 2018. One day I might blog a bit about how it works under the hood, but today I’d like to talk about a small bug that made me crazy for a while.

The bot has been running stable for a couple of years without me touching the code at all. But recently, I noticed that sometimes it would tweet some very common words as if they were new, like “économie” or “opéra”. I deleted them when I saw them, but I didn’t understand why it happened.

I initially attributed the bug to an issue with the search engine: when the bot sees a word it never saw before, it uses Le Monde’s internal search engine to verify that this word indeed appears in a single article. Around the end of 2019,Le Monde ditched their previous search engine, which was slow but exact, in favor of a Qwant-based one which is fast but inexact: it doesn’t respect your query (it may or may not autocorrect your query and there’s no way to prevent it from doing so), it’s not accurate (no result may or may not mean that there is no match) and it’s not stable (do the same query twice and you can get different results; even by a large order of magnitude). In French, we would say this is “de la merde”.

Anyway, I thought it was an issue where maybe using the internal search engine would give a single result instead of multiple ones, but this doesn’t make sense: the bot doesn’t check every single word with the search engine, but only those it thinks are unique. Since it has already scrapped virtually all articles from the newspaper, there’s no way it would consider “économie” as a new word.

Today, I decided to tackle the issue.

First, I searched in the database for a recent example, and I got “vaccinées” (the feminin plural form of “vaccinated”). This is a very old word; it first appeared in Le Monde in 1946 and we saw it more than 1.7k times since then. The word even showed up twice in the database, despite a UNIQUE constraint on the field:

   word
-----------
 vaccinées
 vaccinées

I inspected the strings and noticed they were encoded differently, although they were both in UTF-8. The difference was on the “é” character: it turns out that both are different: one is represented as the Unicode character LATIN SMALL LETTER E WITH ACUTE while the other is a combination of LATIN SMALL LETTER E and COMBINING ACUTE ACCENT. They both render exactly the same on screen, which makes this issue so hard to detect by a human.

>>> "é" == "é"
False

The fix was to normalize the text before interpreting its words. I did it in Python using unicodedata.normalize:

text = unicodedata.normalize("NFKC", text)

There are multiple normalization forms, and I chose NFKC: normalize equivalent characters to their canonical one, and compose all combining characters such that the non-combined and the combined versions become the same.

After fixing the code, I needed to fix the database to fix already-parsed words. In PostgreSQL, we can use SIMILAR TO to search for strings matching a certain characters range: in my case, the “Combining Diacritical Marks”, aka 0300-036F:

SELECT word FROM word WHERE word SIMILAR TO '%[\u0300-\u036F]%' ;

I use Peewee to interact with Postgres from Python. While it doesn’t support SIMILAR TO out of the box, it’s simple to use a custom expression:

Word.select().where(Expression(Word.word, "SIMILAR TO", "%[\\u0300-\\u036F]%"))

Unfortunately this feature is not available in SQLite, the database I use to run my integration tests, and so I had to adapt the code a little bit: first use SqliteExtDatabase instead of SqliteDatabase to get support for REGEXP, and then use the .regexp operator:

# SQLite
Word.select().where(Word.word.regexp("[\u0300-\u036F]"))

I was then able to run a quick function to normalize the ~1k words affected by the issue.

To conclude, it seemed a very weird issue at first but in the end it allowed me to learn a few things about Unicode, Postgres, and Python’s fantastic standard library.