gpt4 book ai didi

python - Scrapy:在每条记录中重复 Response.URL

转载 作者:行者123 更新时间:2023-12-01 02:43:35 25 4
gpt4 key购买 nike

下面的Scrapycrawlspider工作正常,除了url的输出(response.url):

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class Spider2(CrawlSpider):
#name of the spider
name = 'newstl'

#list of allowed domains
allowed_domains = ['graphics.stltoday.com']

#starting url for scraping
start_urls = ['http://graphics.stltoday.com/apps/payrolls/salaries/agencies/']

rules = [
Rule(LinkExtractor(
allow=['/apps/payrolls/salaries/.*/$']),
callback='parse_item',
follow=True),
]

#setting the location of the output csv file
custom_settings = {
'FEED_FORMAT' : "csv",
'FEED_URI' : 'tmp/stltoday1.csv'
}

def parse_item(self, response):
#Remove XML namespaces
response.selector.remove_namespaces()

#Extract article information
name = response.xpath('//th[@scope="row"]/text()').extract()
position = response.xpath('//th[@scope="row"]/following-sibling::*[1]/text()').extract()
salary = response.xpath('//th[@scope="row"]/following-sibling::*[2]/text()').extract()
hiredate = response.xpath('//th[@scope="row"]/following-sibling::*[3]/text()').extract()
url = response.url

for item in zip(name,position, salary, hiredate, url):
scraped_info = {
'url' : item[4],
'name' : item[0],
'position' : item[1],
'salary' : item[2],
'hiredate' : item[3]
}
yield scraped_info

输出在 CSV 的每一行中显示 1 个字符的 URL。有什么办法让它为每条记录重复整个 URL 吗?

最佳答案

您不应该压缩 url,只需直接设置即可:

url = response.url
for item in zip(name, position, salary, hiredate):
yield {
'url' : url,
'name' : item[0],
'position' : item[1],
'salary' : item[2],
'hiredate' : item[3]
}

并且,不要多次遍历整个树,而是迭代结果行并从每个项目的上下文中获取所需的信息:

for row in response.xpath('//th[@scope="row"]'):
yield {
"url": url,
"name": row.xpath('./text()').extract_first(),
"position": row.xpath('./following-sibling::*[1]/text()').extract_first(),
"salary": row.xpath('./following-sibling::*[2]/text()').extract_first(),
"hiredate": row.xpath('./following-sibling::*[3]/text()').extract_first(),
}

关于python - Scrapy:在每条记录中重复 Response.URL,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45426511/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com