blob: 6a86e9a2bea131606656d37cc67b5d7d6373accb (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
|
HTTRUTA Crawler Project
=======================
Download all links from a feed using tools like
[httrack](http://www.httrack.com) and [wkhtmltopdf](https://wkhtmltopdf.org).
This is the engine behind the [Cache](https://cache.fluxo.info) feature used by
the [Semantic Scuttle](http://semanticscuttle.sourceforge.net/) instance known
as [Fluxo de Links](https://links.fluxo.info).
Installation
------------
git clone https://git.fluxo.info/httruta
Dependencies
------------
Recommended:
sudo apt install httrack wkhtmltopdf
Configuration
-------------
The default config is optimized for getting all new stuff added into [Fluxo de
Links](https://links.fluxo.info).
You might use httruta to archive any other website that has RSS support. To customize httruta,
just copy the file `config.default` into `config` and edit to suit your needs.
Usage
-----
Place this script somewhere and setup a cronjob like this:
`*/5 * * * * /var/sites/cache/httruta/httracker &> /dev/null`
Alternatives
------------
- https://github.com/webrecorder/pywb/
- https://github.com/chfoo/wpull
|