home of the madduck/ blog/
Pristine tarballs and VCS

Joey writes about the difficulty of dealing with the need of pristine tarballs and version control systems. He concludes that it's best to save the tarballs in the repository and has written a tool to store binary differences to save space to achieve his goal. This being another piece of software by Joey, it's probably worth checking out.

But I wonder why we actually bother about pristine tarballs. Sure, our Debian archive scripts require a tarball, but why does that have to be the pristine upstream tarball?

As Debian maintainer, I use git to keep track of the software that I package, and if possible, I track the upstream repository directly (Git makes it very easy to track repositories of the other, popular version control systems).

When I decide to upload a new upstream version to the Debian archive, I tag the upstream tree, sign the tag, and generate a tarball from it:

git tag -s -m'preparing release of upstream 1.2.3' upstream/1.2.3
git archive --prefix=foo-1.2.3/ upstream/1.2.3 | gzip -9 \
  > ../foo_1.2.3.orig.tar.gz

Then I merge the tag into my Debian branch and prepare a Debian release, which I also tag (debian/1.2.3-1) just before uploading the package to our archive.

This is very flexible and allows me to release packages based on software snapshots, as I just did with mdadm. Now, a filename of mdadm_2.6.3+200709292116+4450e59.orig.tar.gz pretty clearly states that this is not mdadm-2.6.3.tar.gz, so people would be aware that this tarball is not "upstream-official".

But what about official releases, like mdadm-2.6.3.tar.gz? When preparing the Debian package, it's only slightly more effort for me to download the official tarball and store it as ../mdadm_2.6.3.orig.tar.gz, but why do I need to keep it around after I uploaded Debian package 2.6.3-1? I know I need the orig tarball to prepare the diff.gz for 2.6.3-2, but for that purpose, I can really just git-archive the appropriate tag. The result won't produce the same hash (because the tar format stores timestamps and is thus not really appropriate for the cause), but it provides the same contents.

In fact, what I (or someone else) should finally implement is a tool to create the diff.gz directly from the version control data… or even better, we should finally get rid of source packages as I've previously suggested, long before Ubuntu came around to propose and blueprint the concept.

I am aware of some people's preference to be able to "independently verify Debian's orig tarball using e.g. a detached GPG signature provided by upstream," but I dare to question the point of this verification. Those who care enough to verify that the Debian-provided orig tarball is pristine are hopefully building their own binary packages instead of trusting mine (another beef I've raised in the past), and then they might just as well download the source directly from upstream. I'd say most people (rightfully) don't do this and just use the Debian-provided content, which can be assumed to be pristine if you trust me and Debian: my GPG signature authenticates uploads to the Debian servers, and APT provides authenticity checks downstream up until the installation on an end-user system.

… and if you trust the version control system, and this actually raises an interesting point for me: how feasible would a man-in-the-middle attack between Neil's (the mdadm upstream) repository and mine be, given that I use the plain-text git transport protocol to fetch new commits. If you have any comments on that, please let me know. I'll leave this issue for another discussion or post.

NP: The Beatles: Abbey Road

Update: Matej Cepl pointed me to this thread on the fedora-devel mailing list, where they discuss switching from CVS to Git and talk a lot about their needs and how the end result is much the same as what I am proposing. Thanks!