Baptiste Fontaine’s Blog  (back to the website)

Do You Speak Tar?

xkcd comic #1168

For a lot of people, the GNU tar command’s options seem obscure and hard to use. The most common ones exist only in a short form and always appear grouped in the same order and often without a leading hyphen, e.g. tar xzvf archive.tgz and not tar -v -z -x -f archive.tgz. Additionally, tar doesn’t work without any “option”.

These options, or rather these commands, can be seen as a (small) language that you can learn to speak, write or read. Each command has its own meaning that sometimes depend on which other commands are used with it.

The Grammar

tar’s sentences start with a verb. There’s no subject, because you’re giving an order to tar. This verb is followed by zero or more modifiers that give more context to the action. The last part is the object(s) on which the action is made. Spaces are not needed between tar’s words because they all consist of one letter.

Actions

The two most common actions are “create” (c) and “extract” (x). The first one is used to create an archive from some files; and the second is used to extract that archive in order to get back those files.

All tar implementations support one more action: “list” (t) to list an archive’s content without extracting it. Some implementations support two variants of “create” that are “append” (r) and “update” (u). The former appends files to an existing archive; the latter updates files in the archive for which there exist a more recent version.

Unfortunately we now know all tar actions but can’t do much without knowing how to apply them to an object. Let’s dive into objects and we’ll see the modifiers later.

Objects

tar has a very limited set of objects: archives. Each tar command operates on one archive, that is given by f (for “file”) followed by its path.

Files added or extracted from archives are simply given as extra arguments to these commands without needing any special word.

We’re now ready to write our first meaningful sentences.

“Hey tar, please create an archive file foo.tar with file1 and file2” is written as tar cf foo.tar file1 file2.

“Extract archive file foo.tar” is written as tar xf foo.tar. “List archive file foo.tar” is written as tar tf foo.tar. You get the idea.

Note that actions like “extract” or “list” accept additional arguments for the file patterns you want to extract/list. Say you have a big archive from which you only want to extract one important.txt file. Just give this information to tar and it’ll kindly extract it for you:

tar xf big-archive.tar important.txt

You might wonder what is this “file” word for if we always need it. Well, we can remove it. But if we do so, our tar command doesn’t have any object left, so it’ll look at something else: STDIN or STDOUT.

Actions that read archives operate on STDIN if you don’t give them a file object:

cat big-archive.tar | tar x important.txt

You can also be explicit by giving - to f:

cat big-archive.tar | tar xf - important.txt

The “create” action will output the archive on STDOUT if you don’t give it a name (or use f -). You still need to give it the name of the files to put in that archive:

tar c file1 file2 > archive.tar
# Same, but more explicit
tar cf - file1 file2 > archive.tar

Note that you can’t extract an archive to STDOUT without a modifier. tar operates on files, not on data streams. By default it doesn’t compress its content so creating a tar archive for one file doesn’t make much sense.

Now that we know how to write basic sentences, let’s add some modifiers to them.

Modifiers

In my experience the most used modifiers are v and z. The first one is the “verbose” flag and makes tar more chatty. When creating or extracting an archive it’ll print each file’s name as it’s (un)archiving it. When listing an archive it’ll print more info about each file. Compare both outputs below:

$ tar tf archive.tar
file1
file2
$ tar tvf archive.tar
-rw-r--r--  0 baptiste wheel   31425 18 sep 14:51 file1
-rw-r--r--  0 baptiste wheel   18410 18 sep 14:51 file2

The v modifier can be combined with any other one mentioned below.

z will tell tar to use Gzip to (de)compress the archive. Nowadays tar-ing with no compression is rarely used, and Gzip is ubiquitous. Just add z to your modifiers and tar will create compressed archive and extract them. The convention is to use .tar.gz or .tgz for such archives:

tar czf archive.tar.gz file1 file2
# Later…
tar xzf archive.tar.gz

Other common modifiers include j that works exactly like z but (de)compress using Bzip2 instead of Gzip. Such archives usually end with .tar.bz2 or .tbz2. It’s not named b or B because those were already taken when this modifier was introduced.

Similarly to j one can use its capital friend, J. This one compresses using xz instead of the last Gzip or Bzip2. These archives use the extensions .tar.xz or .txz.

Note you can also (de)compress archives by yourself if you don’t remember these modifiers:

tar cf myarchive.tar file1 file2
gzip myarchive.tar
# Later…
gunzip myarchive.tar
tar xf myarchive.tar

There is a dozen of other modifiers you can find in the manpage, but let’s mention two more: O and k. You may remember from the first section that I wrote you can’t extract an archive to STDOUT without a modifier. Well, that modifier is called O:

tar xOf myarchive.tar

Using O when extracting will print the content of the archive on STDOUT. This is the same output as you would get by calling cat on all the files in it.

The last modifier I wanted to mention is k, which tells tar not to override existing files when extracting an archive. That is, if you already have a file called important.txt in your directory and you un-tar an archive using the k modifier, you can be sure it won’t override your existing important.txt file.


I hope this post helped you have a better understanding of tar commands, and how they’re not that complicated. I put a few (valid) commands below, just so you can see if you understand what’s they’re doing:

tar xf foo.tar
tar cvzf bar.tar a b c
tar tvf foo.tar important.txt
tar cz file1 file2 > somefile
tar xzOf somefile
tar cJf hey.txz you andyou
tar xkjvf onemore.tbz2

Syntax Quiz #1: Ruby’s mysterious percent suite

This is the first post of a serie I’m starting about syntax quirks in various languages. I’ll divide each post in two parts: the first one states the question (mainly “Is this valid? What does it do? How does it work?”); the second one gives an answer.

In Ruby, any sequence of 3+4n (where n≥0) percent signs (%) is valid; can you guess why?

Here are the first members of this suite:

%%%             # n=0
%%%%%%%         # n=1
%%%%%%%%%%%     # n=2
%%%%%%%%%%%%%%% # n=3

I’m using Ruby 2.3.0 but this I was able to test this behavior on the almost-10-years-old Ruby 1.8.5, so you should be fine with any version.

You can stop here and try to solve this problem or skip below for an answer.


The answer to the problem lies in two things: string literals and string formatting.

You might know you can use %q() to create a string; which can be handy if you have both single and double quotes and don’t want to escape them:

my_str = %q(it's a "valid" string)

This method doesn’t support interpolation with #{} but its uppercased friend does:

my_str = %Q(i's still #{42 - 41} "valid" string)

The equivalent also exists to create arrays of strings with %w() and %W(), regular expressions with %r, as well as %i to create arrays of symbols starting in Ruby 2.0.0:

names = %w(alice bob charlie) # => ["alice", "bob", "charlie"]
names.each { |name| puts "Hello #{name}" }

my_syms = %i(my symbols) # => [:my, :symbols]

puts "yep" if %q(my string) =~ %r(my.+regexp)

You can also use [],{} or <> instead of parentheses:

%w{foo bar}             # => ["foo", "bar"]
%q[my string goes here] # => "my string goes here"
%i<a b c>               # => [:a, :b, :c]

Ruby lets you use a percent sign alone as an alias of %Q:

%(foo bar) == %<foo bar> # => true
%{foo bar} == "foo bar"  # => true

But wait; there’s more! You can also use most non-alphanumeric characters like | (%|my string|), ^ (%w^x y z^), or… %:

%w%my array%        # => ["my", "array"]
%q%my string%       # => "my string"
%%my other string%  # => "my other string"

This means that %||, %^^ or %%% can be used to denote an empty string (don’t do that in real programs, please). It answers the problem for the case n=0: %%% is an empty string; the first percent sign indicates it’s a literal string, and the following two are respectively the beginning and end delimiters.

The second part of our answer is string formatting.

If you have ever written a Python program you know it supports string formatting à la sprintf with %:

print "this is %s, I'm %d years-old" % ("Python", 25)

Well, Ruby supports the same method, called String#%:

puts "this is %s, I'm %d years-old" % ["Ruby", 21]

In both languages you can drop the array/tuple if you have only one argument:

print "I'm %s" % "Python"
print "I'm %d" % 25
puts "I'm %s" % "Ruby"
puts "I'm %d" % 21

Both will raise an exception if you have not enough arguments but only Python will do it if you have too many of them:

# Python
print "I'm %s" % ["Python", "Ruby"]
# => TypeError: not all arguments converted during string formatting
# Ruby
puts "I'm %s" % ["Ruby", "Python"]
# prints "I'm Ruby"

This means that while "" % "" is syntaxically valid in both languages, only Ruby runs it without error, because Python raises an exception telling that the argument (the string on the right) is not used.

If we combine this knowledge with what we have above with literal strings we now know we can write the following in Ruby:

%%% % %%% # equivalent to "" % ""

The last key is that it works without spaces and can be chained:

""  % ""  % ""  # => ""
%%% % %%% % %%% # => ""
%%%%%%%%%%%     # => ""

The 3+4n refers to the way the expression is constructed: the first three percent signs are an empty string, and the next four ones are the formatting operator followed by another empty string.


Want more of these? Here are a few other valid Ruby expressions using strings and percent signs (one per line); guess how they’re parsed and evaluated:

%*%%*%%*%%*
%_/\|/\_/\|/\__
%_.-''-._%%_.-''-._%%_.-''-_%%.etc.
%/\/\/\/\/\______/
%# <- is it really valid? :-#

A JavaScript Modules Manager That Fits in a Tweet

ES6 does provide modules; but unless you’re using Babel you’ll have to rely on third-party libraries such as RequireJS until all major browsers support them.

I use D3 everyday to visualize data about ego networks and have a small (400-500 SLOC) JavaScript codebase I need to keep organized. In the context I work in I must keep things simple as I won’t always be there to maintain the code I’m writing today.

How simple a modules implementation could possibly be? It should at least be able to register modules and require a module inside another; much like Python’s import. It should also handle issues like circular dependencies (e.g. foo requires bar which requires foo) and undeclared modules. Modules should be lazily loaded, i.e. only when they are required; and requiring twice the same module shouldn’t execute it twice.

Well, here is one:

p={f:{},m:{},r:function(a,b){p.f[a]=b},g:function(a){if(!p.m[a]){if(p.m[a]<1|!p.f[a])throw"p:"+a;p.m[a]=0;p.m[a]=p.f[a](p)}return p.m[a]}};

It’s 136-bytes long. 139 if you count the variable definition. At this level you can’t expect long function names but here is an usage example:

// register a "main" module. A module consists of a name and a
// function that takes an object used to require other modules.
p.r("main", function(r) {
    // get the "num" module and store it in a `num` variable
    var num = r.g("num");

    // use it to print something
    console.log(num.add(20, 22));
});

// register a "num" module
p.r("num", function(r) {
    // a module can export bindings by returning an object
    return {
        add: function(a, b) { return a+b; },
    };
});

// call the "main" module
p.g("main");

This code will print 42 in the console. It only uses two modules but the implementation works with an arbitrary number of modules. A module can depend on any number of other modules that can be declared in an arbitrary order.

Consider this example:

p.r("m1", function(r) { r.g("m2"); });
p.r("m2", function(r) { r.g("m3"); });
p.r("m3", function(r) { r.g("m1"); });

p.g("m1");

m1 depends on m2 which depends on m3 which itself depends on m1. The implementation won’t die in an endless loop leading to a stack overflow but will fail as soon as it detects the loop:

p:m1

Admittedly this error message doesn’t give us too information but we have to be thrifty in order to fit under 140 characters. The prefix p: tells you the error comes from p, and the part after is the faulty module. It can either be a wrong name (the module doesn’t exist) or a circular dependency.

Walk-through

Note: don’t use this at home. This is just an experiment; I eventually used Browserify for my project.

We need an object to map modules to their functions; we’ll populate it on calls to register. We need another object to store the result of their function call; i.e. what they export. I added a third object to “lock” a module while it’s executed in order to detect circular dependencies.

We’ll have something like that:

var p = {
    _fn: {},   // the functions
    _m: {},    // the modules’ exported values
    _lock: {}, // the locks

    register: function(name, callback) {
        // add the function in the object
        p._fn[name] = callback;
    },

    get: function(name) {
        // if we have a value for this module let’s return it.
        // Note that we should use `.hasOwnProperty` here
        // because this’ll fail if the module returns a falsy
        // value. This is not really important for this problem.
        if (p._m[name]) {
            return p._m[name];
        }

        // if it’s locked that’s because we’re already getting
        // it; so there’s a recursive requirement
        if (p._lock[name]) {
            throw "Recursive requirement: '" + name + "'";
        }

        // if we don’t have any function for this we can’t
        // execute it and get its value. See also the
        // remark about `.hasOwnProperty` above.
        if (!p._fn[name]) {
            throw "Unknown module '" + name + "'";
        }

        // we lock the module so we can detect circular
        // requirements.
        p._lock[name] = true;

        try {
            // execute the module's function and pass
            // ourselves to it so it can require other
            // modules with p.get.
            p._m[name] = p._fn[name](p);
        } finally {
            // ensure we *always* remove the lock.
            delete p._lock[name];
        }

        // return the result
        return p._m[name];
    },
};

This works and is pretty short; but that won’t fit in a Tweet ;)

Let’s compact the exceptions into one because those strings take a lot of place:

if (p._lock[name] || !p._fn[name]) {
    throw "Module error: " + name;
}

The error is less explicit but we’ll accept that here.

We try to get as little code as possible then use YUI Compressor to remove the spaces and rename the variables. This means we can still work with (mostly) readable code and let YUI Compressor do the rest for us.

I measure the final code size with the following command:

yuicompressor p.js | wc -c

Right now we have 240 bytes. We need a way to remove 100 bytes. Let’s rename the attributes. _fn becomes f; _m becomes m, _lock becomes l and the public methods are reduced to their first letter. We can also remove the var since p will be global anyway. Let’s also reduce the error message prefix to "p:".

p = {
    f: {}, m: {}, l: {},

    r: function(name, callback) { p.f[name] = callback; },

    g: function(name) {
        if (p.m[name]) {
            return p.m[name];
        }

        if (p.l[name] || !p.f[name]) {
            throw "p:" + name;
        }

        p.l[name] = true;

        try {
            p.m[name] = p.f[name](p);
        } finally {
            delete p.l[name];
        }

        return p.m[name];
    },
};

That’s 186 bytes once compressed. Not bad! Note that we have twice the same line in the g function (previously known as “get”):

return p.m[name];

We can invert the first if condition and fit the whole code in it; combining both returns into one. This is equivalent to transforming this code:

function () {
    if (A) {
        return B;
    }

    // ...
    return B;
}

Into this one:

function () {
    if (!A) {
        // ...
    }

    return B;
}

The first form is preferable because it removes one indentation level for the function body. But here return is a keyword we can’t compress.

Speaking of keyword we can’t compress; how could we remove the delete? All we care about is to know if there’s a lock or not, so we can set the value to false instead, at the expense of more memory. This saves us only one byte but since we only care about the boolean values we can replace true with 1 and false with 0.

We’re now at 166 bytes and the g function looks like this:

function(name) {
    if (!p.m[name]) {
        if (p.l[name] || !p.f[name]) {
            throw "p:" + name;
        }

        p.l[name] = 1;

        try {
            p.m[name] = p.f[name](p);
        } finally {
            p.l[name] = 0;
        }
    }

    return p.m[name];
}

Now, what if we tried to remove one of the three objects we’re using? We need to keep the functions and the results in separate objects but we might be able to remove the locks object without losing the functionality.

Assuming that modules only return objects let’s merge m and l. We’ll set p.m[A] to 0 if it’s locked and will then override the lock with the result. p.m[A] then have the following possible values:

  • undefined: the key doesn’t exist; the module hasn’t been required yet
  • 0: the module is currently being executed
  • something else: the module has already been executed; we have its return value

We need to modify our code a little bit for this:

function(name) {
    if (!p.m[name]) {
        if (p.m[name] === 0 || !p.f[name]) {
            throw "p:" + name;
        }

        p.m[name] = 0;
        p.m[name] = p.f[name](p);
    }

    return p.m[name];
}

Note that this allowed us to get ride of the try/finally which let us go down to 143 bytes. We can already save two bytes by using < 1 instead of === 0.

Replacing || (boolean OR) with | (binary OR) saves one more byte and allows us to fit in 140 bytes! We can go further and remove the brackets for the inner if since it only has one instruction. We need to do that after the compression because YUI Compressor adds brackets if they’re missing.

The final code looks like this:

p = {
    f: {}, m: {},

    r: function(name, callback) { p.f[name] = callback; },

    g: function(name) {
        if (!p.m[name]) {
            if (p.m[name] < 1 | !p.f[name])
                throw "p:" + name;

            p.m[name] = 0;
            p.m[name] = p.f[name](p);
        }

        return p.m[name];
    },
};

That’s 139 bytes once compressed! You can see the result at the top of this blog post.
Please add a comment below if you think of any way to reduce this further while preserving all existing features.

Thank you for reading!