Joey writes about the difficulty of dealing with the need of pristine tarballs and version control systems. He concludes that it’s best to save the tarballs in the repository and has written a tool to store binary differences to save space to achieve his goal. This being another piece of software by Joey, it’s probably worth checking out.
But I wonder why we actually bother about pristine tarballs. Sure, our Debian archive scripts require a tarball, but why does that have to be the pristine upstream tarball?
As Debian maintainer, I use git to keep track of the software that I package, and if possible, I track the upstream repository directly (Git makes it very easy to track repositories of the other, popular version control systems).
When I decide to upload a new upstream version to the Debian archive, I tag the upstream tree, sign the tag, and generate a tarball from it:
git tag -s -m'preparing release of upstream 1.2.3' upstream/1.2.3
git archive --prefix=foo-1.2.3/ upstream/1.2.3 | gzip -9 \
> ../foo_1.2.3.orig.tar.gz
Then I merge the tag into my Debian branch and prepare a Debian
release, which I also tag (debian/1.2.3-1) just before
uploading the package to our archive.
This is very flexible and allows me to release packages based on
software snapshots, as I just did
with mdadm. Now, a filename of
mdadm_2.6.3+200709292116+4450e59.orig.tar.gz pretty
clearly states that this is not
mdadm-2.6.3.tar.gz, so people would be aware that this
tarball is not “upstream-official”.
But what about official releases, like
mdadm-2.6.3.tar.gz? When preparing the Debian package,
it’s only slightly more effort for me to download the official
tarball and store it as ../mdadm_2.6.3.orig.tar.gz,
but why do I need to keep it around after I uploaded Debian package
2.6.3-1? I know I need the orig tarball
to prepare the diff.gz for 2.6.3-2, but
for that purpose, I can really just git-archive the
appropriate tag. The result won’t produce the same hash (because
the tar format stores timestamps and is thus not
really appropriate for the cause), but it provides the same
contents.
In fact, what I (or someone else) should finally implement is a
tool to create the diff.gz directly from the version
control data… or even better, we should finally get rid of source
packages as I’ve
previously suggested, long before Ubuntu came around to
propose
and
blueprint the concept.
I am aware of some people’s preference to be able to
“independently verify Debian’s orig tarball using e.g.
a detached GPG signature provided by upstream,” but I dare to
question the point of this verification. Those who care enough to
verify that the Debian-provided orig tarball is
pristine are hopefully building their own binary packages instead
of trusting mine (another beef
I’ve raised in the past), and then they might just as well
download the source directly from upstream. I’d say most people
(rightfully) don’t do this and just use the Debian-provided
content, which can be assumed to be pristine if you trust me and
Debian: my GPG signature authenticates uploads to the Debian
servers, and APT provides authenticity checks downstream up until
the installation on an end-user system.
… and if you trust the version control system, and this actually
raises an interesting point for me: how feasible would a
man-in-the-middle attack between Neil’s (the mdadm
upstream) repository and mine be, given that I use the plain-text
git transport protocol to fetch new commits. If you
have any comments on that, please let me know. I’ll leave this
issue for another discussion or post.
NP: The Beatles: Abbey Road
Update: Matej Cepl pointed me to
this thread on the fedora-devel mailing list,
where they discuss switching from CVS to
Git and talk a lot about their needs and how the end
result is much the same as what I am proposing. Thanks!

