gpt4 book ai didi

java - 在 Heritrix 爬虫工具中如何从爬取的 url 中提取内容

转载 作者:行者123 更新时间:2023-12-04 04:43:54 25 4
gpt4 key购买 nike

我是 heritrix 工具的新手,现在我能够从 www 抓取网页,现在想要提取抓取的 url 的内容。

请帮助我任何人。拜托。提前致谢。

最佳答案

 1.first download the file  wget http://python.org/ftp/python/3.3.0/Python-3.3.0.tgz or higher version as root user.
2. change the directory to installed python
3. example /opt/python3.3/;
4. configure the files ./configure --prefix=/opt/python3.3
5.make
6. sudo make install
7. /opt/python3.3/bin/python3
8.opt/python3.3/bin/pyvenv ~/py33
9.source ~/py33/bin/activate
10. wget http://python-distribute.org/distribute_setup.py
11.python distribute_setup.py
12. easy_install pip
13. pip install bottle
14. pip install warcat
15. if successfully installed warcat then check whether your warcat is installed or not.
16. python3 -m warcat --help after enter then we can see some help commands like, list,concat,extract etc..
17.python3 -m warcat list example/at.warc.gz
this is worked for me ..enjoy

关于java - 在 Heritrix 爬虫工具中如何从爬取的 url 中提取内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18486121/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com