gpt4 book ai didi

python - Scrapy 不使用我当前的语法返回网页的文本正文

转载 作者:行者123 更新时间:2023-12-01 05:06:16 24 4
gpt4 key购买 nike

我在 Windows Vista 64 位上使用 Python.org 版本 2.7 64 位。我成功地使用用 Scrapy 构建的递归网络抓取器来解析维基百科文章中的所有文本。但是,我尝试将相同的代码应用于代码中引用的网站,但它没有返回任何文本正文:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time


class ExampleSpider(CrawlSpider):
name = "goal3"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com"]
download_delay = 1

rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]
#rules = [Rule(SgmlLinkExtractor(allow=()),
#follow=True),
#Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
#]
#rules = [
#Rule(
#SgmlLinkExtractor(allow=('Regions/252/Tournaments/2',)),
#callback='parse_item',
#follow=True,
#)
#]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
scripts = response.selector.xpath("normalize-space(//title)")
for scripts in scripts:
body = response.xpath('//p').extract()
body2 = "".join(body)
print remove_tags(body2).encode('utf-8')


execute(['scrapy','crawl','goal3'])

我可能想查看的示例页面如下所示:

http://www.whoscored.com/Articles/pn4gahfw90kjwje-yx7ztq/Show/Player-Focus-Potential-Change-in-System-may-Convince-Vidal-to-Leave-Juventus据我了解,上面的代码应该提取页面上找到的任何文本字符串并将它们连接在一起。上面示例页面的 HTML 标记使用 <p> 封装文本。标签,所以我不确定为什么这不起作用。任何人都可以看到为什么我返回的只是使用此代码的页脚的明显原因吗?

最佳答案

parse_item() 内部有点困惑。这是从所有段落(p 标签)获取文本并将其连接起来的固定版本:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.markup import remove_tags


class ExampleSpider(CrawlSpider):
name = "goal3"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com"]
download_delay = 1

rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]

def parse_item(self,response):
paragraphs = response.selector.xpath("//p").extract()
text = "".join(remove_tags(paragraph).encode('utf-8') for paragraph in paragraphs)
print text

对于this page它打印:

"There is no budget, there is money. We are in a very strong financial position. We can make big signings." Music to the ears of Manchester United fans as vice-chairman Ed Woodward confirmed the club can make big-money acquisitions in this very transfer window. In a bid to return to the summit of England’s top tier, Woodward has effectively given the green light to a spending spree that has supporters rubbing their hands with glee. Ander Herrara and Luke Shaw have arrived for a combined £59m already this summer and the carousel through the Old Trafford entrance door shows no sign of slowing down. Ángel Di María, Mats Hummels and Daley Blind, amongst others, have all been linked with a move to United, while reports suggesting midfield pitbull Arturo Vidal is set to join Louis van Gaal’s side refuse to die down.  "I’m still on holiday at the moment. Can I say I’m staying at Juve? I don’t know. On Monday I’ll talk to (Juventus manager, Massimili
...
Contact Us | About Us | Glossary | Privacy Policy | WhoScored Ratings
Copyright © 2014 WhoScored.com

关于python - Scrapy 不使用我当前的语法返回网页的文本正文,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24966296/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com