gpt4 book ai didi

python - 如何使用Scrapy从变量中提取文本?

转载 作者:太空宇宙 更新时间:2023-11-03 14:57:33 25 4
gpt4 key购买 nike

我正在使用 Scrapy 抓取业务目录,并在尝试使用变量提取数据时遇到问题。这是代码:

    def parse_page(self, response):
url = response.meta.get('URL')

# Parse the locations area of the page
locations = response.css('address::text').extract()
# Takes the City and Province and removes unicode and removes whitespace,
# they are still together though.
city_province = locations[1].replace(u'\xa0', u' ').strip()
# List of all social links that the business has
social = response.css('.entry-content > div:nth-child(2) a::attr(href)').extract()

add_info = response.css('ul.list-border li').extract()
year = ""

for info in add_info:
if 'Year' in info:
year = info
else:
pass

yield {
'title': response.css('h1.entry-title::text').extract_first().strip(),
'description': response.css('p.mb-double::text').extract_first(),
'phone_number': response.css('div.mb-double ul li::text').extract_first(default="").strip(),
'email': response.css('div.mb-double ul li a::text').extract_first(default=""),
'address': locations[0].strip(),
'city': city_province.split(' ', 1)[0].replace(',', ''),
'province': city_province.split(' ', 1)[1].replace(',', '').strip(),
'zip_code': locations[2].strip(),
'website': response.css('.entry-content > div:nth-child(2) > ul:nth-child(2) > li:nth-child(1) > a:nth-child(1)::attr(href)').extract_first(default=''),
'facebook': response.css('.entry-content > div:nth-child(2) > ul:nth-child(2) > li:nth-child(2) > a:nth-child(1)::attr(href)').extract_first(default=''),
'twitter': response.css('.entry-content > div:nth-child(2) > ul:nth-child(2) > li:nth-child(3) > a:nth-child(1)::attr(href)').extract_first(default=''),
'linkedin': response.css('.entry-content > div:nth-child(2) > ul:nth-child(2) > li:nth-child(4) > a:nth-child(1)::attr(href)').extract_first(default=''),
'year': year,
'employees': response.css('.list-border > li:nth-child(2)::text').extract_first(default="").strip(),
'key_contact': response.css('.list-border > li:nth-child(3)::text').extract_first(default="").strip(),
'naics': response.css('.list-border > li:nth-child(4)::text').extract_first(default="").strip(),
'tags': response.css('ul.biz-tags li a::text').extract(),
}

我遇到的问题来自这里:

        add_info = response.css('ul.list-border li').extract()
year = ""

for info in add_info:
if 'Year' in info:
year = info
else:
pass

代码检查信息是否为“成立年份”。但是,它返回 HTML。我正在尝试让它打印出年份。 add_info = response.css('ul.list-border li::text').extract()将打印出年份,但我该如何在 for 中执行此操作循环?

每当“年份”位于info时它输出如下:<li><span>Year Established:</span> 1998</li> 。我希望只获取年份而不是 HTML。

最佳答案

添加以下功能。

def getYear(yearnum):
yearnum1 = str(yearnum[35:])
yearnum2 = str(yearnum1[:4])
return yearnum2

然后将 for 语句替换为以下内容。

for info in add_info:
if 'Year' in info:
yearanswer = getYear(info)
else:
pass

然后它会从你的长字符串中取出 4 位数字并将其放入字符串yearanswer中。如果你打印年份答案应该打印 1998。它对我来说是这样的!

关于python - 如何使用Scrapy从变量中提取文本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45404474/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com