This tutorial is a spin-off from the Download an Entire Website article, or second part if you will. In case you are new to archival with Wget, read that for the baby steps of the setup. Recently, I learned a few more things and applied them to download a Discourse site successfully. I joined something like a time-limited membership-only mastermind group, a period of which has come to an end. So I wanted to package it up and keep the content for later reference if I ever need it.
I had to use a few additional Wget options to archive a private forum. Since it needs to log in to the site, a cookie is a must. Also, I’m implementing random time delay, discarding the robots directive, and act as a Google bot to download an HTML-only version of Discourse. You might have noticed it’s a JS-heavy and colorful platform, well your archive will be nothing like that. If the content is what matters to you, that is possible to fetch, of course.
Wget options to download Discourse
As always, consult the Wget documentation for more info. These are just the extra options specific to this situation with my recommended values.
--wait=1
Delay every request by a second. It might not seem much, but compared to how fast it usually is and how many files it will download, it’s a balanced value. Since you are personally identifiable to the site owner, you want to play nice (not overwhelm the server). A ban that is not necessarily human-initiated can result in losing access to the site. You can’t just juggle VPNs to come up with a new IP since you likely only have one account.
--random-wait
Make your one-second delay anything between 0.5s and 1.5s. Subsequently, Wget’s action will appear less robotic. You don’t want to look as automatic retrieval, don’t you?
--user-agent="Googlebot/2.1 (+http://www.google.com/bot.html)"
This user-agent is vital to download Discourse. As can be seen, the platform eases the life of bots by serving a static version of the site. You can see why and how this works by checking the source code on GitHub. Crawler detection and the specs for user agents are all there.
--execute robots=off
Execute a wgetrc command that acts as a directive to disregard robots blocking. Primarily useful in a private forum, if the robots.txt file blocks search engines. Check the contents of that file for anything meaningful, though. Some parts of the site may be hidden from robots with a reason other than privacy.
--no-cookies
In the old days, browsers used to have a cookies.txt file. Not anymore. I’m not using the traditional cookies approach of wget, so this feature is unnecessary.
--header "Cookie: _t=1a4a5800ba501ebba6078b392003dedb"
The cookie header is the heart of it all. I’m specifying the auth token, so Wget sees the site as if you were logged in. For non-Discourse sites, the _t
will be different. As for the random hex value, it’s likely to change or expire. See the next section to get this value.
How to find your auth token cookie
- Visit and log in to the Discourse forum in Chrome or a similar browser like Vivaldi.
- Press F12 to access the Developer Tools.
- Go to the Application tab.
- On the left side, under Storage, open Cookies, and the site’s host.
- Find
_t
in the table and copy the value.
See more about the auth token in the Ruby source files of Discourse, by looking for cookies["_t"]
in default_current_user_provider_spec.rb or session_controller_spec.rb on GitHub.
If you are downloading a site that is not Discourse, here is how to find out which cookie is responsible for the login state. Start deleting them one by one and see which one results in a logout.
The Wget command to download a Discourse site
With all that, I fabricated the following command:
wget --mirror --page-requisites --convert-links --adjust-extension --compression=auto --reject-regex "/search" --no-if-modified-since --no-check-certificate --execute robots=off --random-wait --wait=1 --user-agent="Googlebot/2.1 (+http://www.google.com/bot.html)" --no-cookies --header "Cookie: _t=1a4a5800ba501ebba6078b392003dedb" https://example.com/
As usual, you don’t get to see a progress bar as there is no way to know beforehand how much data is to come. For your reference, a site with around 3K threads, 200 users, 5K media files took:
Total wall clock time: 11h 23m 46s
Downloaded: 24956 files, 4,8G in 2h 19m 4s (600 KB/s)
The wall clock time is this much because of the random delays that pause execution.
If you wish to check the results while the download is underway, don’t be discouraged by broken CSS. The --convert-links
option will process your files after the download is complete (it will try to rewrite the asset URLs to be relative).
How to use your Discourse archive
Your local version is not perfect, and if you open the index or a category file, you’ll see the thread lists are rather bland. Therefore, it’s unlikely you’d find it easy or even want to read every thread there is, so come up with a new way of discovering what’s interesting for you. There is no longer any site search, and navigating the thread lists is not very productive. Your best bet is entering the folder t (threads) and opening a thread judging by its name. Furthermore, you could use the “find in files” feature of software like Notepad++, Total Commander, Sublime Text, Visual Studio Code, etc.
How to find your threads
I wanted a way to find threads started by me or where I contributed. Here is how to isolate those, but I only do this on Linux (Ubuntu in a VM). Firstly, just to be safe, make a copy of the t folder (it has one subdirectory for each thread) and open that folder in the terminal.
This command copies the folders of threads that contain my user name (firsh). You’ll find these in a separate folder.
mkdir -p ../firsh | grep -ilsr firsh . | sed -r 's|\./([^/]+)/.*$|\1|' | sort | uniq | xargs cp -rt ../firsh
To copy just the threads I started (first poster):
mkdir -p ../by-firsh | grep -Pzilsr '(?s)<span class=.creator. itemprop=.author. .+firsh.+<span itemprop=.position.>#1</span>' . | sed -r 's|\./([^/]+)/.*$|\1|' | sort | uniq | xargs cp -rt ../by-firsh
Commands I used:
Ongoing archival?
Please note that I’m yet to encounter a scenario where I archive a site on an ongoing basis. So I’m not entirely sure how you’d proceed if you wanted to re-archive but only download the changes. For me, these projects are one-time efforts when I know a site is about to go down or I’m about to lose access. If I were to refresh the archive, I’d probably download the entire site again in a different directory and delete the old copy.
Comments are closed.