gpt4 book ai didi

python - Scrapy Spider 抓取部分内容并留下其他内容

转载 作者:太空宇宙 更新时间:2023-11-03 16:57:57 28 4
gpt4 key购买 nike

我定义了一个scrapy Spider,它可以抓取所有名字和一些故事,而定义的xpath无法捕获故事,来自https://www.cancercarenorthwest.com/survivor-stories ,

# -*- coding: utf-8 -*-

import scrapy
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import XmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from cancerstories.items import CancerstoriesItem

class LungcancerSpider(CrawlSpider):
name = "lungcancer"
allowed_domains = ["coloncancercoalition.org"]
start_urls = (
'http://www.coloncancercoalition.org/community/stories/survivor-stories/',
)
rules = (
Rule(SgmlLinkExtractor(allow=[r'http://www.coloncancercoalition.org/\d+/\d+/\d+/\w+']),callback='parse_page',follow=True),
)

def parse_page(self, response):
Li = ItemLoader(item=CancerstoriesItem(),response=response)
Li.add_xpath('name', '/html/body/div[4]/div[1]/div[1]/div/h1/text()')
Li.add_xpath('story','//../div/div/p/text()')

yield Li.load_item()

最佳答案

我认为您需要加入帖子内容下所有段落的文本:

Li.add_xpath('story', '//div[@class="post-content"]/div/p/text()', Join(" "))

哪里Join()输出处理器导入为:

from scrapy.loader.processors import Join

关于python - Scrapy Spider 抓取部分内容并留下其他内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35261470/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com