gpt4 book ai didi

python - 每页具有不同元素定位的抓取表

转载 作者:行者123 更新时间:2023-11-28 22:26:44 28 4
gpt4 key购买 nike

设置

我正在使用 Scrapy 抓取房屋广告,随后使用 pandas 分析数据。

对于每个房屋广告,我都会收集房屋特征,例如“大小”、“房间数”等。随后会在字典中生成这些特征。


问题

我正在抓取的房屋广告在表格中显示了房屋特征,这是完全可抓取的。

但是,并非所有广告都包含相同的特征,即有些广告会显示所有可能特征的信息,有些广告则不会。

由于大多数广告都有一些缺失的特征,每个广告的表格都不同,例如'size' 可以在第 1 行第 2 列或第 2 行第 1 列或其他位置。

检查表中 ad1 之间的差异和 ad2 .

我希望能够抓取所有表格,并从每个广告中获取尽可能多的信息。此外,信息应该分配给正确的变量。 IE。 '205m2' 应该分配给 'size' 而不是 'rooms'。


方法

我目前的做法是先抓取列标题中的变量名称,然后将其附带的值分配给变量。 IE。首先抓取列标题,检查是否为变量“大小”,然后抓取其值并将值分配给变量“大小”。

无效代码:

for i in range(1,5):
x = response.xpath('//*[@id="details"]/table/tr[{i}]/td[1]/text()').extract_first().strip()
if 'size' in x:
size = response.xpath('//*[@id="details"]/table/tr[{i}]/td[2]/text()').extract_first().strip()
elif 'rooms' in x:
rooms = response.xpath('//*[@id="details"]/table/tr[{i}]/td[2]/text()').extract_first().strip()

直观地说,此代码遍历列标题,检查变量并随后将值分配给相应的变量。

但是,我只在运行这段代码时收到错误。我究竟做错了什么?有没有更好的方法?

最佳答案

如果您查看 HTML 结构,每个表格行总是有 4 个单元格,每行都是一个字段名一个字段值另一个字段名称另一个字段值:

<div class="details" id="details" >
<strong class="sec_name">Property details</strong>

<table cellpadding="0" cellspacing="0" style="margin-top:0px;">
<tr>
<td class='title'>Ref.: </td>
<td>Via Scarpellini (1019194)</td>

<td class='title'>Ad date: </td>
<td>
23/05/2017 </td>
</tr>

<tr> <td class='title'>Rooms: </td>
<td> 5</td>
<td class='title'>Bathrooms:</td>
<td> 3</td>
</tr>
<tr>
<td class='title'>Floor area: </td>
<td> 292m&sup2;</td>
<td class='title'>Heating: </td>
<td> Communal</td>
</tr>
...

一种常见的模式是在每个表行上循环,并使用 following-sibling 成对处理单元格。 XPath 中的轴。

让我们先看看每个表格行,使用 scrapy shell 中的一个链接(使用 CSS 选择器 div#details table tr ):

$ https://property-italy.immobiliare.it/62225510-penthouses-to-rent-Rome.html
>>> from pprint import pprint
>>> pprint(response.css('div#details table tr'))
[<Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\n\t \t \t \t '>,
<Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\n\t \t <td class="title"'>,
<Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\t <td class="title">Rooms: </td'>,
<Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\n\t \t <td class="title">Flo'>,
<Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\t <td class="title">Terrace: </'>,
<Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\t <td class="title">Total floor'>,
<Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\t <td class="title">Garden: </t'>,
<Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\t <td class="title">Furniture: '>]

对于每一行,我们可以检查它是否包含 4 <td>单元格(第一个除外,它是空的):

>>> for row in response.css('div#details table tr'):
... pprint(row.xpath('.//td'))
...
[]
[<Selector xpath='.//td' data='<td class="title">Ref.: </td>'>,
<Selector xpath='.//td' data='<td>Trieste</td>'>,
<Selector xpath='.//td' data='<td class="title">Ad date: </td>'>,
<Selector xpath='.//td' data='<td>\n\t\t 13/06/2017\t\t</td>'>]
[<Selector xpath='.//td' data='<td class="title">Rooms: </td>'>,
<Selector xpath='.//td' data='<td> 4</td>'>,
<Selector xpath='.//td' data='<td class="title">Bathrooms:</td>'>,
<Selector xpath='.//td' data='<td> 3</td>'>]
[<Selector xpath='.//td' data='<td class="title">Floor area: </td>'>,
<Selector xpath='.//td' data='<td> 132m²</td>'>,
<Selector xpath='.//td' data='<td class="title">Heating: </td>'>,
<Selector xpath='.//td' data='<td> Autonomous</td>'>]
[<Selector xpath='.//td' data='<td class="title">Terrace: </td>'>,
<Selector xpath='.//td' data='<td> Yes</td>'>,
<Selector xpath='.//td' data='<td class="title">Floor: </td>'>,
<Selector xpath='.//td' data='<td>2</td>'>]
[<Selector xpath='.//td' data='<td class="title">Total floors: </td>'>,
<Selector xpath='.//td' data='<td>3</td>'>,
<Selector xpath='.//td' data='<td class="title">Garage:</td>'>,
<Selector xpath='.//td' data='<td> no</td>'>]
[<Selector xpath='.//td' data='<td class="title">Garden: </td>'>,
<Selector xpath='.//td' data='<td>Nothing</td>'>,
<Selector xpath='.//td' data='<td class="title">Condition: </td>'>,
<Selector xpath='.//td' data='<td>excellent/refurbished</td>'>]
[<Selector xpath='.//td' data='<td class="title">Furniture: </td>'>,
<Selector xpath='.//td' data='<td>Partly Furnished</td>'>,
<Selector xpath='.//td' data='<td class="title">Property type: </td>'>,
<Selector xpath='.//td' data='<td>whole estate</td>'>]

data=选择器的预览,您可以看到每隔一个 <td>有课"title" ,所以让我们尝试再次使用 CSS 选择器获取该信息 ( td.title ):

>>> for row in response.css('div#details table tr'):
... print(row.css('td.title').get())
...
None
<td class="title">Ref.: </td>
<td class="title">Rooms: </td>
<td class="title">Floor area: </td>
<td class="title">Terrace: </td>
<td class="title">Total floors: </td>
<td class="title">Garden: </td>
<td class="title">Furniture: </td>

字段值在<td>中在每个 <td class="title"> 之后. XPath 的 following-sibling::td[1]可以在这里使用。大致意思是 “给我 <td>,它是我所在位置( sibling )的同一个 parent 的 child ,但只是我之后的第一个 child ”。Scrapy 选择器的好处是你可以链接 CSS 和 XPath:

>>> for row in response.css('div#details table tr'):
... print('---some row---')
... for cell in row.css('td.title'):
... print(' ---some cell---')
... print(cell.xpath('following-sibling::td[1]').get())
...
---some row---
---some row---
---some cell---
<td>Trieste</td>
---some cell---
<td>
13/06/2017 </td>
---some row---
---some cell---
<td> 4</td>
---some cell---
<td> 3</td>
---some row---
---some cell---
<td> 132m²</td>
---some cell---
<td> Autonomous</td>
---some row---
---some cell---
<td> Yes</td>
---some cell---
<td>2</td>
---some row---
---some cell---
<td>3</td>
---some cell---
<td> no</td>
---some row---
---some cell---
<td>Nothing</td>
---some cell---
<td>excellent/refurbished</td>
---some row---
---some cell---
<td>Partly Furnished</td>
---some cell---
<td>whole estate</td>

所以我们有字段名和字段值。让我们将 2 组合成键/值对:

>>> for row in response.css('div#details table tr'):
... for cell in row.css('td.title'):
... print((cell.xpath('string(.)').get(), cell.xpath('string(following-sibling::td[1])').get()))
...
('Ref.: ', 'Trieste')
('Ad date: ', '\n\t\t 13/06/2017\t\t')
('Rooms: ', ' 4')
('Bathrooms:', ' 3')
('Floor area: ', ' 132m²')
('Heating: ', ' Autonomous')
('Terrace: ', ' Yes')
('Floor: ', '2')
('Total floors: ', '3')
('Garage:', ' no')
('Garden: ', 'Nothing')
('Condition: ', 'excellent/refurbished')
('Furniture: ', 'Partly Furnished')
('Property type: ', 'whole estate')

你可以通过字典理解将它变成一个漂亮的 Python 字典:

>>> {cell.xpath('string(.)').get():
... cell.xpath('string(following-sibling::td[1])').get()
... for row in response.css('div#details table tr')
... for cell in row.css('td.title')}
{'Ref.: ': 'Trieste', 'Ad date: ': '\n\t\t 13/06/2017\t\t', 'Rooms: ': ' 4', 'Bathrooms:': ' 3', 'Floor area: ': ' 132m²', 'Heating: ': ' Autonomous', 'Terrace: ': ' Yes', 'Floor: ': '2', 'Total floors: ': '3', 'Garage:': ' no', 'Garden: ': 'Nothing', 'Condition: ': 'excellent/refurbished', 'Furniture: ': 'Partly Furnished', 'Property type: ': 'whole estate'}

>>> pprint(_)
{'Ad date: ': '\n\t\t 13/06/2017\t\t',
'Bathrooms:': ' 3',
'Condition: ': 'excellent/refurbished',
'Floor area: ': ' 132m²',
'Floor: ': '2',
'Furniture: ': 'Partly Furnished',
'Garage:': ' no',
'Garden: ': 'Nothing',
'Heating: ': ' Autonomous',
'Property type: ': 'whole estate',
'Ref.: ': 'Trieste',
'Rooms: ': ' 4',
'Terrace: ': ' Yes',
'Total floors: ': '3'}

我在这里使用 XPath string()获取每个<td>的文本内容单元格,但我也可以使用 normalize-space()摆脱额外的空白:

>>> {cell.xpath('normalize-space(.)').get():
... cell.xpath('normalize-space(following-sibling::td[1])').get()
... for row in response.css('div#details table tr')
... for cell in row.css('td.title')}
{'Ref.:': 'Trieste', 'Ad date:': '13/06/2017', 'Rooms:': '4', 'Bathrooms:': '3', 'Floor area:': '132m²', 'Heating:': 'Autonomous', 'Terrace:': 'Yes', 'Floor:': '2', 'Total floors:': '3', 'Garage:': 'no', 'Garden:': 'Nothing', 'Condition:': 'excellent/refurbished', 'Furniture:': 'Partly Furnished', 'Property type:': 'whole estate'}
>>> pprint(_)
{'Ad date:': '13/06/2017',
'Bathrooms:': '3',
'Condition:': 'excellent/refurbished',
'Floor area:': '132m²',
'Floor:': '2',
'Furniture:': 'Partly Furnished',
'Garage:': 'no',
'Garden:': 'Nothing',
'Heating:': 'Autonomous',
'Property type:': 'whole estate',
'Ref.:': 'Trieste',
'Rooms:': '4',
'Terrace:': 'Yes',
'Total floors:': '3'}

关于python - 每页具有不同元素定位的抓取表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44585747/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com