HTTRUTA Crawler Project
Download all links from a feed using tools like httrack and wkhtmltopdf. This is the engine behind the Cache feature used by the Semantic Scuttle instance known as Fluxo de Links.
Installation
git clone https://git.fluxo.info/httruta
Dependencies
Recommended:
sudo apt install httrack wkhtmltopdf
Configuration
The default config is optimized for getting all new stuff added into Fluxo de Links.
You might use httruta to archive any other website that has RSS support. To customize httruta,
just copy the file config.default
into config
and edit to suit your needs.
Usage
Place this script somewhere and setup a cronjob like this:
*/5 * * * * /var/sites/cache/httruta/httracker &> /dev/null
Alternatives
- https://github.com/webrecorder/pywb/
- https://github.com/chfoo/wpull