HTTRUTA Crawler Project

Download all links from a feed using tools like httrack and wkhtmltopdf. This is the engine behind the Cache feature used by the Semantic Scuttle instance known as Fluxo de Links.

Installation

git clone https://git.fluxo.info/httruta

Dependencies

Recommended:

sudo apt install httrack wkhtmltopdf

Configuration

The default config is optimized for getting all new stuff added into Fluxo de Links.

You might use httruta to archive any other website that has RSS support. To customize httruta, just copy the file config.default into config and edit to suit your needs.

Usage

Place this script somewhere and setup a cronjob like this:

*/5 * * * * /var/sites/cache/httruta/httracker &> /dev/null

Alternatives

https://github.com/webrecorder/pywb/
https://github.com/chfoo/wpull