gpt4 book ai didi

python - Scrapy 是否可以从原始 HTML 数据中获取纯文本?

转载 作者:技术小花猫 更新时间:2023-10-29 12:17:19 25 4
gpt4 key购买 nike

例如:

scrapy shell http://scrapy.org/
content = hxs.select('//*[@id="content"]').extract()[0]
print content

然后,我得到以下原始 HTML 代码:

<div id="content">


<h2>Welcome to Scrapy</h2>

<h3>What is Scrapy?</h3>

<p>Scrapy is a fast high-level screen scraping and web crawling
framework, used to crawl websites and extract structured data from their
pages. It can be used for a wide range of purposes, from data mining to
monitoring and automated testing.</p>

<h3>Features</h3>

<dl>

<dt>Simple</dt>
<dt>
</dt>
<dd>Scrapy was designed with simplicity in mind, by providing the features
you need without getting in your way
</dd>

<dt>Productive</dt>
<dd>Just write the rules to extract the data from web pages and let Scrapy
crawl the entire web site for you
</dd>

<dt>Fast</dt>
<dd>Scrapy is used in production crawlers to completely scrape more than
500 retailer sites daily, all in one server
</dd>

<dt>Extensible</dt>
<dd>Scrapy was designed with extensibility in mind and so it provides
several mechanisms to plug new code without having to touch the framework
core

</dd>
<dt>Portable, open-source, 100% Python</dt>
<dd>Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD</dd>

<dt>Batteries included</dt>
<dd>Scrapy comes with lots of functionality built in. Check <a
href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this
section</a> of the documentation for a list of them.
</dd>

<dt>Well-documented &amp; well-tested</dt>
<dd>Scrapy is <a href="/doc/">extensively documented</a> and has an comprehensive test suite
with <a href="http://static.scrapy.org/coverage-report/">very good code
coverage</a></dd>

<dt><a href="/community">Healthy community</a></dt>
<dd>
1,500 watchers, 350 forks on Github (<a href="https://github.com/scrapy/scrapy">link</a>)<br>
700 followers on Twitter (<a href="http://twitter.com/ScrapyProject">link</a>)<br>
850 questions on StackOverflow (<a href="http://stackoverflow.com/tags/scrapy/info">link</a>)<br>
200 messages per month on mailing list (<a
href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link</a>)<br>
40-50 users always connected to IRC channel (<a href="http://webchat.freenode.net/?channels=scrapy">link</a>)
</dd>

<dt><a href="/support">Commercial support</a></dt>
<dd>A few companies provide Scrapy consulting and support</dd>

<p>Still not sure if Scrapy is what you're looking for?. Check out <a
href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a
glance</a>.

</p>
<h3>Companies using Scrapy</h3>

<p>Scrapy is being used in large production environments, to crawl
thousands of sites daily. Here is a list of <a href="/companies/">Companies
using Scrapy</a>.</p>

<h3>Where to start?</h3>

<p>Start by reading <a href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a glance</a>,
then <a href="/download/">download Scrapy</a> and follow the <a
href="http://doc.scrapy.org/en/latest/intro/tutorial.html">Tutorial</a>.


</p></dl>
</div>

但是我想直接从scrapy获取纯文本

我不想使用任何 xPath 选择器来提取 ph2h3... 标签,因为我正在爬行主要内容嵌入到tabletbody 中的网站;递归地。找到 xPath 可能是一项乏味的任务。

这可以通过 Scrapy 中的内置函数实现吗?或者我需要外部工具来转换它吗? Scrapy的所有文档我都看完了,但是一无所获。

这是一个可以将原始 HTML 转换为纯文本的示例站点:http://beaker.mailchimp.com/html-to-text

最佳答案

Scrapy 没有内置这样的功能。 html2text正是您要找的。

这是一个抓取 wikipedia's python page 的蜘蛛样本, 使用 xpath 获取第一段并使用 html2text 将 html 转换为纯文本:

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
import html2text


class WikiSpider(BaseSpider):
name = "wiki_spider"
allowed_domains = ["www.wikipedia.org"]
start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]

def parse(self, response):
hxs = HtmlXPathSelector(response)
sample = hxs.select("//div[@id='mw-content-text']/p[1]").extract()[0]

converter = html2text.HTML2Text()
converter.ignore_links = True
print(converter.handle(sample)) #Python 3 print syntax

打印:

**Python** is a widely used general-purpose, high-level programming language.[11][12][13] Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C.[14][15] The language provides constructs intended to enable clear programs on both a small and large scale.[16]

关于python - Scrapy 是否可以从原始 HTML 数据中获取纯文本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17721782/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com