gpt4 book ai didi

python - Scrapy:使用编码和 POST 作为 JSON 数组从多个元素中提取

转载 作者:行者123 更新时间:2023-11-30 23:20:48 26 4
gpt4 key购买 nike

我正在抓取一个天气网站,需要从表格单元格中提取注释并将它们作为 JSON 数组发布到远程 API。

这是标记:

<td>
<p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
<p>Temperature is very warm (> 60 degrees C / 140 degrees F).</p>
<p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
</td>

这是我正在使用的代码:

comments = []
cmnts = sel.xpath('td//p/text()').extract()

for cmnt in cmnts:
comments.append(cmnt)

item['comments'] = comments

r = requests.post(api_url, data = json.dumps(dict(item)))

这有点工作,但它在输出中有很多“\r\n”字符串,并且“<”符号之后的任何内容都被删除。以下是上述代码的输出:

[
"Temperature is cold (\r\n \r\n ",
"Temperature is very warm (> 60 degrees C / 140 degrees F)."
"Temperature is cold (\r\n \r\n ",
]

关于如何获得“干净”(即无返回)和“编码”结果数组的任何想法?

最佳答案

正如@alecxe在上面的评论中所建议的,lxml的默认解析器似乎不能很好地处理这个HTML输入,解决方案是使用更宽容的解析器来解析它,比如BeautifulSoup或html5lib

lxml 实际上可以使用不同的解析器,但仍然为您提供相同的 XPath 方法。

使用 BeautifulSoup 解析器:

In [1]: from lxml.html import soupparser, html5parser

In [2]: html = """<td>
<p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
<p>Temperature is very warm (> 60 degrees C / 140 degrees F).</p>
<p>Temperature is cold (< 4 degrees C / 40 degrees F).</p>
</td>
"""

In [3]: doc = soupparser.fromstring(html)

In [4]: for p in doc.xpath('//p'):
print p.xpath('normalize-space()')
...:
Temperature is cold (< 4 degrees C / 40 degrees F).
Temperature is very warm (> 60 degrees C / 140 degrees F).
Temperature is cold (< 4 degrees C / 40 degrees F).

使用 html5lib 解析器(您必须在 XPath 调用中添加 XHTML 命名空间):

In [5]: doc = html5parser.fromstring(html)

In [6]: for p in doc.xpath('//xhtml:p', namespaces={"xhtml": "http://www.w3.org/1999/xhtml"}):
print p.xpath('normalize-space()')
...:
Temperature is cold (< 4 degrees C / 40 degrees F).
Temperature is very warm (> 60 degrees C / 140 degrees F).
Temperature is cold (< 4 degrees C / 40 degrees F).

In [7]:

你的 Scrapy 回调代码将变成:

doc = soupparser.fromstring(response.body)

comments = []
cmnts = doc.xpath('td//p')

for cmnt in cmnts:
comments.append(cmnt.xpath('normalize-space(.)'))

item['comments'] = comments

r = requests.post(api_url, data = json.dumps(dict(item)))

关于python - Scrapy:使用编码和 POST 作为 JSON 数组从多个元素中提取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25252718/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com