gpt4 book ai didi

python - 在 scrapy 中,我使用 XPATH 来选择 HTML 并得到许多不必要的 ""和,?

转载 作者:行者123 更新时间:2023-11-28 22:33:58 25 4
gpt4 key购买 nike

我在解析 http://so.gushiwen.org/view_20788.aspx 时遇到问题

Inspector

这就是我想要的:

"detail_text": ["
寥落古行宫,宫花寂寞红。白头宫女在,闲坐说玄宗。
"],

但我明白了:

"detail_text": ["
", "
", "
", "
", "
寥落古行宫,宫花寂寞红。", "白头宫女在,闲坐说玄宗。
"],

这是我的代码:

#spider
class Tangshi3Spide(scrapy.Spider):
name = "tangshi3"
allowed_domains = ["gushiwen.org"]
start_urls = [
"http://so.gushiwen.org/view_20788.aspx"
]
def __init__(self):
self.items = []

def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[@class="main3"]/div[@class="shileft"]')
domain = 'http://so.gushiwen.org'
for site in sites:
item = Tangshi3Item()
item['detail_title'] = site.xpath('div[@class="son1"]/h1/text()').extract()
item['detail_dynasty'] = site.xpath(
u'div[@class="son2"]/p/span[contains(text(),"朝代:")]/parent::p/text()').extract()
item['detail_translate_note_url'] = site.xpath('div[@id="fanyiShort676"]/p/a/u/parent::a/@href').extract()
item['detail_appreciation_url'] = site.xpath('div[@id="shangxiShort787"]/p/a/u/parent::a/@href').extract()
item['detail_background_url'] = site.xpath('div[@id="shangxiShort24492"]/p/a/u/parent::a/@href').extract()
#question line
item['detail_text'] = site.xpath('div[@class="son2"]/text()').extract()
self.items.append(item)
return self.items



#pipeline
class Tangshi3Pipeline(object):
def __init__(self):
self.file = codecs.open('tangshi_detail.json', 'w', encoding='utf-8')

def process_item(self, item, spider):
line = json.dumps(dict(item))
self.file.write(line.decode("unicode_escape"))
return item

我怎样才能得到正确的文本?

最佳答案

您可以添加谓词 [normalize-space()] 以避免拾取空文本节点,即仅包含空格的节点:

item['detail_text'] = site.xpath('div[@class="son2"]/text()[normalize-space()]').extract()

关于python - 在 scrapy 中,我使用 XPATH 来选择 HTML 并得到许多不必要的 ""和,?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39360013/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com