python - Scrapy 不使用我当前的语法返回网页的文本正文-6ren

python - Scrapy 不使用我当前的语法返回网页的文本正文

转载作者：行者123 更新时间：2023-12-01 05:06:16

24

4

我在 Windows Vista 64 位上使用 Python.org 版本 2.7 64 位。我成功地使用用 Scrapy 构建的递归网络抓取器来解析维基百科文章中的所有文本。但是，我尝试将相同的代码应用于代码中引用的网站，但它没有返回任何文本正文:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com"]
    download_delay = 1

    rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]
    #rules = [Rule(SgmlLinkExtractor(allow=()), 
                  #follow=True),
             #Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
    #]
    #rules = [
        #Rule(
            #SgmlLinkExtractor(allow=('Regions/252/Tournaments/2',)), 
            #callback='parse_item',
            #follow=True,
        #)
    #]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)
        scripts = response.selector.xpath("normalize-space(//title)")
        for scripts in scripts:
            body = response.xpath('//p').extract()
            body2 = "".join(body)
            print remove_tags(body2).encode('utf-8')  


execute(['scrapy','crawl','goal3'])

我可能想查看的示例页面如下所示:

http://www.whoscored.com/Articles/pn4gahfw90kjwje-yx7ztq/Show/Player-Focus-Potential-Change-in-System-may-Convince-Vidal-to-Leave-Juventus据我了解，上面的代码应该提取页面上找到的任何文本字符串并将它们连接在一起。上面示例页面的 HTML 标记使用 <p> 封装文本。标签，所以我不确定为什么这不起作用。任何人都可以看到为什么我返回的只是使用此代码的页脚的明显原因吗？

最佳答案

parse_item() 内部有点困惑。这是从所有段落(p 标签)获取文本并将其连接起来的固定版本:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.markup import remove_tags


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com"]
    download_delay = 1

    rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]

    def parse_item(self,response):
        paragraphs = response.selector.xpath("//p").extract()
        text = "".join(remove_tags(paragraph).encode('utf-8') for paragraph in paragraphs)
        print text

对于this page它打印:

"There is no budget, there is money. We are in a very strong financial position. We can make big signings." Music to the ears of Manchester United fans as vice-chairman Ed Woodward confirmed the club can make big-money acquisitions in this very transfer window. In a bid to return to the summit of England’s top tier, Woodward has effectively given the green light to a spending spree that has supporters rubbing their hands with glee. Ander Herrara and Luke Shaw have arrived for a combined £59m already this summer and the carousel through the Old Trafford entrance door shows no sign of slowing down. Ángel Di María, Mats Hummels and Daley Blind, amongst others, have all been linked with a move to United, while reports suggesting midfield pitbull Arturo Vidal is set to join Louis van Gaal’s side refuse to die down.  "I’m still on holiday at the moment. Can I say I’m staying at Juve? I don’t know. On Monday I’ll talk to (Juventus manager, Massimili
...
 Contact Us | About Us | Glossary | Privacy Policy | WhoScored Ratings
            Copyright © 2014 WhoScored.com

关于python - Scrapy 不使用我当前的语法返回网页的文本正文，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24966296/

24

4

0

文章推荐： python - 使用 xpath 和 domdocuments 进行抓取

文章推荐： Python - 我不知道如何在这段代码中工作

文章推荐： Solr 模式，如何获取集合中的动态字段

文章推荐： java - blueJ kareltje 在世界各地 build 港口

postgresql - 组内级联的Postgres交叉表(文本，文本)
表架构 DROP TABLE bla; CREATE TABLE bla (id INTEGER, city INTEGER, year_ INTEGER, month_ INTEGER, val I
javascript - 按一定顺序分割字符串。例如文本/0000/文本/文本
我需要拆分字符串/或从具有以下结构的字符串中获取更容易的子字符串。字符串将来自 window.location.pathname 或 window.location.href，看起来像 text/n
ios - 将对象添加到数组时更新 textView 文本，而不覆盖前一个对象的 textView 文本
每当将对象添加到数组中时，我都会尝试更新 TextView ，并在 TextView 中显示该文本，如下所示: "object 1" "object 2" 问题是，每次将新对象添加到数组时，它都会覆盖
java - Html 2 文本 - 删除 "hidden"文本
我目前正在寻找使用 Java 读取网站可见文本并将其存储为纯文本字符串的方法。换句话说，我想转换成这样: Hello stupid World进入“ Hello World ” 或者类似的东西 Un
php - Pear Mail，如何以UTF-8发送纯文本/文本+文本/html
我正在尝试以文本和 HTML 格式发送电子邮件，但无法正确发送正确的 header 。特别是，我想设置 Content-Type header ，但我找不到如何为 html 和文本部分单独设置它。这
c# - 从资源 wpf 绑定(bind)文本 block 文本
我尝试了上面的代码，但我无法绑定(bind)文本，我怎样才能将资源内部文本 bloc
unity3d - Unity 网络播放器因 UI 文本(新 Canvas 文本)而崩溃
我刚刚完成了 Space Shooter 教程，由于没有 GUIText 对象，所以我创建了 UI.Text 对象并进行了相应的编码。它在统一播放器中有效，但在构建 Web 应用程序后无效。我花了一段
ios - 为什么 UITextField 文本 setter 无法识别 [UIView 文本] 选择器
我有这个代码: - (IBAction)setButtonPressed:(id)sender { NSUserDefaults *sharedDefaults = [[NSUserDefau
java - 在 JLabel 图标上添加 JLabel 文本。使用相同的 JLabel 文本
抱歉标题含糊不清，但我想不出我想在标题中做什么。无论如何，对于图像上的文本，我使用了 JLabel 文本并将其添加到图标中。 JLabel icon = new JLabel(new Imag
javascript - "The stylesheet was not loaded because its MIME type, "文本/html "is not "文本/css"
关闭。这个问题是not reproducible or was caused by typos .它目前不接受答案。这个问题是由于错别字或无法再重现的问题引起的。虽然类似的问题可能是on-topi
html - 是否可以使用 CSS 定位 HTML(文本)？ - 它显示为(文本)作为 ID
我在将 Twitter 嵌入到我从 HTML 5 转换的 wordpress 运行网站时遇到问题。我遇到的问题是推文不是我的自定义字体... 这是我无法使用任何 css 定位的 HTML 代码，我正
java - 将 logger.debug ("message: "+ 文本)转换为 logger.debug(消息 : {}", 文本)
我正在尝试找到解决由于使用以下形式的代码而导致的冗余字符串连接问题的最佳方法: logger.debug("Entering loop, arg is: " + arg) // @1 在大多数情况下，
java分组正则表达式无法匹配字符串+文本
我写了这个测试 @Test public void removeRequestTextFromRouteError() throws Exception { String input = "F
java正则表达式匹配&[文本]
我目前正在创建一个正则表达式来拆分所有匹配以下格式的字符串:&[文本]，并且需要获取文本。字符串可能类似于:something &[text] &[text] everything &[text] 等
CSS变形词/文本
有没有办法将标题文本从一个词变形为另一个词，同时保留两个词中使用的字母？我看过的许多 css 文本动画大多是视觉的，很少有旋转整个单词的。我想要做的是从一个词过渡，例如“BEACH”到“CHANGE
学习python中matplotlib绘图设置坐标轴刻度、文本
总结matplotlib绘图如何设置坐标轴刻度大小和刻度。上代码： ?
容器内的 Flutter 文本
我在容器 (1) 中创建了容器 (2)。你能帮忙如何向容器(1)添加文本吗？下面是我的代码 return Scaffold( body: Padding( padding: c
具有渐变和渐变轮廓的 CSS 文本
我似乎找不到任何人或任何人这样做过。我试图限制我们使用的图像数量，并想创建一个带有渐变作为其“颜色”的文本，并在其周围设置渐变轮廓/描边到目前为止，我还没有看到任何将两者结合在一起的东西。我可以自
从视频游戏截图中提取 Python 文本
我正在为视频游戏暗黑破坏神 2 使用 discord.py 构建一个不和谐机器人。其中一项功能要求机器人从暗黑破坏神 2 屏幕截图中提取项目的名称和属性。我目前正在为此使用 pytesseract，但
在ggplot2中旋转 strip 文本
我很难弄清楚如何旋转 strip.text theme 中的属性来自 ggplot2 .我使用的是 R 版本 3.4.2 和 ggplot2 版本 2.2.1。以下是 MWE 的数据。 > dput

首页

博学

6Ren·AI

商城

python - Scrapy 不使用我当前的语法返回网页的文本正文