Every now and then I encounter a web page which I’d like to archive for later purposes, e.g. because I can see that it might be related to my research, or simply because I want to keep the information until eternity.
I know of three ways to achieve this, but each of them has drawbacks:
-
Using
wgetto download the page. This does not pull any dependencies (althoughwgetcould), nor does it stamp the downloaded file with the original URL and time it was downloaded. -
Printing the page to a PDF. Apart from being a cumbersome process, requiring at least five mouseclicks and way more attention than this should take (in Firefox at least), printed webpages don’t look the same as on-screen. Even though Firefox stamps printouts with URL and time, this information is truncated if e.g. the URL is too long, and in a printout, you lose all the meta information, as well as links etc.
-
Using Firefox’s ‘Save as’ feature. Again, being too cumbersome for my taste, this actually manages to properly archive a webpage with all dependencies. However, it drops a file and a folder on your hard disk, which I find rather annoying. And it also doesn’t stamp the downloaded files with neither URL, nor time.
So, dear lazyweb, how do you do it? I am aware that it’ll be
trivial to hack up a little script which uses wget to
obtain page and dependencies, create a file with stamp data and
produce a frameset to put this data above the actual file, but
before I spend time implementing the wheel, I’d like to make sure
it doesn’t yet exist.
NP: The Flower Kings: Paradox Hotel
Update: Michael Stevens pointed me to furl.net, but I need a solution that I can
use offline. Part of the reason that I want to archive web pages is
because I don’t want to come back in a couple of months and find a
webpage removed, a server dead, or a site like
furl.net discontinued.
Jörg Jaspert introduced
me to Scrapbook, a Firefox
extension I’ll have to check out. However, even before looking at
it: I really want files or directories on the filesystem, so that I
can check them into version control and have them archived and
replicated. I don’t trust any data to Firefox or any of its
extensions, or anything under ~/.mozilla. This is also
the reason why I Slogger doesn’t cut the
mustard for me.
Matej Cepl wants me to use konqueror to archive
them. I don’t want to install hundreds of megabytes for this
feature, nor can I imagine that this feature will be much different
or less cumbersome than Firefox’s ability to save pages. I hate the
mouse.
Thanks for all your comments, regardless!
Update: Just found this post about archiving stuff like digg/delicious.
Update: Jörg pointed out that Scrapbook can save files outside the Firefox profile, but it’s still not what I am looking for: I want a single file, ideally, or a directory.
Marcos Dione points me to his Kreissy browser, which seems to
take an interesting approach, but it also depends on KDE and as
with konqueror, I don’t want to install a hundred
dependencies.
Alfonso Ali mentioned Zotero, which uses an
sqlite database and seems interesting, but databases
are not suitable for storage in version control systems, so this
one is out too.

