README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

HTTRUTA Crawler Project
=======================

Download all links from a feed using tools like
[httrack](http://www.httrack.com) and [wkhtmltopdf](https://wkhtmltopdf.org).
This is the engine behind the [Cache](https://cache.fluxo.info) feature used by
the [Semantic Scuttle](http://semanticscuttle.sourceforge.net/) instance known
as [Fluxo de Links](https://links.fluxo.info).

Installation
------------

    git clone https://git.fluxo.info/httruta

Dependencies
------------

Recommended:

    sudo apt install httrack wkhtmltopdf

Configuration
-------------

The default config is optimized for getting all new stuff added into [Fluxo de
Links](https://links.fluxo.info).

You might use httruta to archive any other website that has RSS support. To customize httruta,
just copy the file `config.default` into `config` and edit to suit your needs.

Usage
-----

Place this script somewhere and setup a cronjob like this:

`*/5 * * * * /var/sites/cache/httruta/httracker &> /dev/null`

Alternatives
------------

- https://github.com/webrecorder/pywb/
- https://github.com/chfoo/wpull