gpt4 book ai didi

python - 在Scrapy中,如何将正则表达式中的两个组提取到两个不同的字段中?

转载 作者:太空宇宙 更新时间:2023-11-03 14:33:35 25 4
gpt4 key购买 nike

我正在编写一个蜘蛛 trulia 来抓取 Trulia.com 上待售特性的页面,例如 https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123 ;当前版本可以在 https://github.com/khpeek/trulia-scraper 上找到.

我正在使用Item Loaders并调用add_xpath方法使用 re 关键字参数来指定要提取的正则表达式。在文档的示例中,正则表达式中只有一组和要提取到的字段。

但是,我实际上想定义两个组并将它们提取到两个单独的 Scrapy 字段。以下是 parse_property_page 方法的“摘录”:

def parse_property_page(self, response):
l = TruliaItemLoader(item=TruliaItem(), response=response)

details = l.nested_css('.homeDetailsHeading')
overview = details.nested_xpath('.//span[contains(text(), "Overview")]/parent::div/following-sibling::div[1]')
overview.add_xpath('overview', xpath='.//li/text()')
overview.add_xpath('area', xpath='.//li/text()', re=r'([\d,]+) sqft$')
overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (acres|sqft) lot size$')

请注意 lot_size 字段如何提取两组:一组用于数字,一组用于单位,可以是“英亩”或“平方英尺”。如果我使用命令

运行此 parse 方法
scrapy parse https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123 --spider=trulia --callback=parse_property_page

然后我得到以下抓取的项目:

# Scraped Items  ------------------------------------------------------------
[{'address': '1860 Lombard St',
'area': 2524.0,
'city_state': 'San Francisco, CA 94123',
'dates': ['10/22/2002', '04/25/2002', '03/20/2000'],
'description': ['Outstanding investment opportunity to own this light-fixer '
'mixed use Marina 2-unit property w/established income and '
'not on liquefaction. The first floor of this building '
'houses a commercial business currently leased to Jigalin '
'Fitness until 2018. The second floor presents a 2bed/1bath '
'apartment fully outfitted in a contemporary design w/full '
'kitchen, 10ft high ceilings & laundry area. The apartment '
'will be delivered vacant. The structure has undergone '
'renovation & features concrete perimeter foundation, '
'reinforced walls, ADA compliant commercial restroom, '
'electrical updates & rolling door. This property makes an '
"ideal investment with instant cash flow. Don't let this "
'pass you by. As-Is sale.'],
'events': ['Sold', 'Sold', 'Sold'],
'listing_information': ['2 Bedrooms', 'Multi-Family'],
'listing_information_date_updated': '11/03/2017',
'lot_size': ['1620', 'sqft'],
'neighborhood': 'Marina',
'overview': ['Multi-Family',
'2 Beds',
'Built in 1908',
'1 days on Trulia',
'1620 sqft lot size',
'2,524 sqft',
'$711/sqft'],
'prices': ['$850,000', '$1,350,000', '$1,200,000'],
'public_records': ['1 Bathroom',
'Multi-Family',
'1,296 Square Feet',
'Lot Size: 1,620 sqft'],
'public_records_date_updated': '07/01/2017',
'url': 'https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123'}]

其中lot_size字段是包含数字和单位的列表。但是,我理想情况下希望将单位(英亩或平方英尺)提取到单独的字段lot_size_units。我可以通过首先加载项目并进行自己的处理来做到这一点,但我想知道是否有更 Scrapy 原生的方法将匹配的组“解包”到不同的项目中?

(我已经仔细阅读了 https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/loader/init.py 上的 get_value 方法,但这还没有“给我指明方向”(如果有的话)。

最佳答案

您可以尝试此操作(一次忽略一组):

overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (?:acres|sqft) lot size$')
overview.add_xpath('lot_size_units', xpath='.//li/text()', re=r'(?:[\d,]+) (acres|sqft) lot size$')

关于python - 在Scrapy中,如何将正则表达式中的两个组提取到两个不同的字段中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47115511/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com