blob: 6b28e6a0e9f1dc07fab1cc681c8043ed9d833406 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
HTTRUTA Crawler Project
=======================
Download all links from a feed using tools like
[httrack](http://www.httrack.com) and [wkhtmltopdf](https://wkhtmltopdf.org).
This is the engine behind the [Cache](https://cache.fluxo.info) feature used by
the [Semantic Scuttle](http://semanticscuttle.sourceforge.net/) instance known
as [Fluxo de Links](https://links.fluxo.info).
Installation
------------
git clone https://git.fluxo.info/httruta
Dependencies
------------
Recommended:
sudo apt install httrack wkhtmltopdf
Configuration
-------------
The default config is optimized for getting all new stuff added into [Fluxo de
Links](https://links.fluxo.info).
You might use httruta to archive any other website that has RSS support. To customize httruta,
just copy the file `config.default` into `config` and edit to suit your needs.
Usage
-----
Place this script somewhere and setup a cronjob like this:
`*/5 * * * * /var/sites/cache/httruta/httracker &> /dev/null`
|