home of the madduck/ blog/
Archiving web pages

Every now and then I encounter a web page which I'd like to archive for later purposes, e.g. because I can see that it might be related to my research, or simply because I want to keep the information until eternity.

I know of three ways to achieve this, but each of them has drawbacks:

So, dear lazyweb, how do you do it? I am aware that it'll be trivial to hack up a little script which uses wget to obtain page and dependencies, create a file with stamp data and produce a frameset to put this data above the actual file, but before I spend time implementing the wheel, I'd like to make sure it doesn't yet exist.

NP: The Flower Kings: Paradox Hotel

Update: Michael Stevens pointed me to furl.net, but I need a solution that I can use offline. Part of the reason that I want to archive web pages is because I don't want to come back in a couple of months and find a webpage removed, a server dead, or a site like furl.net discontinued.

Jörg Jaspert introduced me to Scrapbook, a Firefox extension I'll have to check out. However, even before looking at it: I really want files or directories on the filesystem, so that I can check them into version control and have them archived and replicated. I don't trust any data to Firefox or any of its extensions, or anything under ~/.mozilla. This is also the reason why I Slogger doesn't cut the mustard for me.

Matej Cepl wants me to use konqueror to archive them. I don't want to install hundreds of megabytes for this feature, nor can I imagine that this feature will be much different or less cumbersome than Firefox's ability to save pages. I hate the mouse.

Thanks for all your comments, regardless!

Update: Just found this post about archiving stuff like digg/delicious.

Update: Jörg pointed out that Scrapbook can save files outside the Firefox profile, but it's still not what I am looking for: I want a single file, ideally, or a directory.

Marcos Dione points me to his Kreissy browser, which seems to take an interesting approach, but it also depends on KDE and as with konqueror, I don't want to install a hundred dependencies.

Alfonso Ali mentioned Zotero, which uses an sqlite database and seems interesting, but databases are not suitable for storage in version control systems, so this one is out too.

Andreas Schamanek wrote in with a pointer to the MHT format, which bundles HTML pages and their dependencies into a plain text file using MIME. Internet Explorer apparently already handles this format, and UnMHT provides software for the other browsers. As Firefox 6 is not yet supported, I went to try the Mozilla Archive Format extension, which seems to do the same thing and works quite well.