python - 每页具有不同元素定位的抓取表-6ren

python - 每页具有不同元素定位的抓取表

转载作者：行者123 更新时间：2023-11-28 22:26:44

设置

我正在使用 Scrapy 抓取房屋广告，随后使用 pandas 分析数据。

对于每个房屋广告，我都会收集房屋特征，例如“大小”、“房间数”等。随后会在字典中生成这些特征。

问题

我正在抓取的房屋广告在表格中显示了房屋特征，这是完全可抓取的。

但是，并非所有广告都包含相同的特征，即有些广告会显示所有可能特征的信息，有些广告则不会。

由于大多数广告都有一些缺失的特征，每个广告的表格都不同，例如'size' 可以在第 1 行第 2 列或第 2 行第 1 列或其他位置。

检查表中 ad1 之间的差异和 ad2 .

我希望能够抓取所有表格，并从每个广告中获取尽可能多的信息。此外，信息应该分配给正确的变量。 IE。 '205m²' 应该分配给 'size' 而不是 'rooms'。

方法

我目前的做法是先抓取列标题中的变量名称，然后将其附带的值分配给变量。 IE。首先抓取列标题，检查是否为变量“大小”，然后抓取其值并将值分配给变量“大小”。

无效代码:

for i in range(1,5):
        x = response.xpath('//*[@id="details"]/table/tr[{i}]/td[1]/text()').extract_first().strip()
        if 'size' in x:
            size = response.xpath('//*[@id="details"]/table/tr[{i}]/td[2]/text()').extract_first().strip()
        elif 'rooms' in x:
            rooms = response.xpath('//*[@id="details"]/table/tr[{i}]/td[2]/text()').extract_first().strip()

直观地说，此代码遍历列标题，检查变量并随后将值分配给相应的变量。

但是，我只在运行这段代码时收到错误。我究竟做错了什么？有没有更好的方法？

最佳答案

如果您查看 HTML 结构，每个表格行总是有 4 个单元格，每行都是一个字段名，一个字段值，另一个字段名称，另一个字段值:

<div class="details" id="details" >
  <strong class="sec_name">Property details</strong>

  <table cellpadding="0" cellspacing="0" style="margin-top:0px;">
          <tr>
                    <td class='title'>Ref.: </td>
        <td>Via Scarpellini (1019194)</td>

                        <td class='title'>Ad date: </td>
    <td>
      23/05/2017        </td>
                            </tr>

    <tr>          <td class='title'>Rooms: </td>
      <td> 5</td>
                              <td class='title'>Bathrooms:</td>
      <td> 3</td>
                            </tr>
    <tr>
              <td class='title'>Floor area: </td>
      <td> 292m&sup2;</td>
                                        <td class='title'>Heating: </td>
        <td> Communal</td>
                            </tr>
    ...

一种常见的模式是在每个表行上循环，并使用 following-sibling 成对处理单元格。 XPath 中的轴。

让我们先看看每个表格行，使用 scrapy shell 中的一个链接(使用 CSS 选择器 div#details table tr ):

$ https://property-italy.immobiliare.it/62225510-penthouses-to-rent-Rome.html
>>> from pprint import pprint
>>> pprint(response.css('div#details table tr'))
[<Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\n\t      \t        \t          \t       '>,
 <Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\n\t        \t        <td class="title"'>,
 <Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\t      <td class="title">Rooms: </td'>,
 <Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\n\t      \t      <td class="title">Flo'>,
 <Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\t      <td class="title">Terrace: </'>,
 <Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\t      <td class="title">Total floor'>,
 <Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\t      <td class="title">Garden: </t'>,
 <Selector xpath="descendant-or-self::div[@id = 'details']/descendant-or-self::*/table/descendant-or-self::*/tr" data='<tr>\t      <td class="title">Furniture: '>]

对于每一行，我们可以检查它是否包含 4 <td>单元格(第一个除外，它是空的):

>>> for row in response.css('div#details table tr'):
...     pprint(row.xpath('.//td'))
... 
[]
[<Selector xpath='.//td' data='<td class="title">Ref.: </td>'>,
 <Selector xpath='.//td' data='<td>Trieste</td>'>,
 <Selector xpath='.//td' data='<td class="title">Ad date: </td>'>,
 <Selector xpath='.//td' data='<td>\n\t\t  13/06/2017\t\t</td>'>]
[<Selector xpath='.//td' data='<td class="title">Rooms: </td>'>,
 <Selector xpath='.//td' data='<td> 4</td>'>,
 <Selector xpath='.//td' data='<td class="title">Bathrooms:</td>'>,
 <Selector xpath='.//td' data='<td> 3</td>'>]
[<Selector xpath='.//td' data='<td class="title">Floor area: </td>'>,
 <Selector xpath='.//td' data='<td> 132m²</td>'>,
 <Selector xpath='.//td' data='<td class="title">Heating: </td>'>,
 <Selector xpath='.//td' data='<td> Autonomous</td>'>]
[<Selector xpath='.//td' data='<td class="title">Terrace: </td>'>,
 <Selector xpath='.//td' data='<td> Yes</td>'>,
 <Selector xpath='.//td' data='<td class="title">Floor: </td>'>,
 <Selector xpath='.//td' data='<td>2</td>'>]
[<Selector xpath='.//td' data='<td class="title">Total floors: </td>'>,
 <Selector xpath='.//td' data='<td>3</td>'>,
 <Selector xpath='.//td' data='<td class="title">Garage:</td>'>,
 <Selector xpath='.//td' data='<td> no</td>'>]
[<Selector xpath='.//td' data='<td class="title">Garden: </td>'>,
 <Selector xpath='.//td' data='<td>Nothing</td>'>,
 <Selector xpath='.//td' data='<td class="title">Condition: </td>'>,
 <Selector xpath='.//td' data='<td>excellent/refurbished</td>'>]
[<Selector xpath='.//td' data='<td class="title">Furniture: </td>'>,
 <Selector xpath='.//td' data='<td>Partly Furnished</td>'>,
 <Selector xpath='.//td' data='<td class="title">Property type: </td>'>,
 <Selector xpath='.//td' data='<td>whole estate</td>'>]

在data=选择器的预览，您可以看到每隔一个 <td>有课"title" ，所以让我们尝试再次使用 CSS 选择器获取该信息 ( td.title ):

>>> for row in response.css('div#details table tr'):
...     print(row.css('td.title').get())
... 
None
<td class="title">Ref.: </td>
<td class="title">Rooms: </td>
<td class="title">Floor area: </td>
<td class="title">Terrace: </td>
<td class="title">Total floors: </td>
<td class="title">Garden: </td>
<td class="title">Furniture: </td>

字段值在<td>中在每个 <td class="title"> 之后. XPath 的 following-sibling::td[1]可以在这里使用。大致意思是 “给我 <td>，它是我所在位置( sibling )的同一个 parent 的 child ，但只是我之后的第一个 child ”。Scrapy 选择器的好处是你可以链接 CSS 和 XPath:

>>> for row in response.css('div#details table tr'):
...     print('---some row---')
...     for cell in row.css('td.title'):
...         print('  ---some cell---')
...         print(cell.xpath('following-sibling::td[1]').get())
... 
---some row---
---some row---
  ---some cell---
<td>Trieste</td>
  ---some cell---
<td>
          13/06/2017        </td>
---some row---
  ---some cell---
<td> 4</td>
  ---some cell---
<td> 3</td>
---some row---
  ---some cell---
<td> 132m²</td>
  ---some cell---
<td> Autonomous</td>
---some row---
  ---some cell---
<td> Yes</td>
  ---some cell---
<td>2</td>
---some row---
  ---some cell---
<td>3</td>
  ---some cell---
<td> no</td>
---some row---
  ---some cell---
<td>Nothing</td>
  ---some cell---
<td>excellent/refurbished</td>
---some row---
  ---some cell---
<td>Partly Furnished</td>
  ---some cell---
<td>whole estate</td>

所以我们有字段名和字段值。让我们将 2 组合成键/值对:

>>> for row in response.css('div#details table tr'):
...     for cell in row.css('td.title'):
...         print((cell.xpath('string(.)').get(), cell.xpath('string(following-sibling::td[1])').get()))
... 
('Ref.: ', 'Trieste')
('Ad date: ', '\n\t\t  13/06/2017\t\t')
('Rooms: ', ' 4')
('Bathrooms:', ' 3')
('Floor area: ', ' 132m²')
('Heating: ', ' Autonomous')
('Terrace: ', ' Yes')
('Floor: ', '2')
('Total floors: ', '3')
('Garage:', ' no')
('Garden: ', 'Nothing')
('Condition: ', 'excellent/refurbished')
('Furniture: ', 'Partly Furnished')
('Property type: ', 'whole estate')

你可以通过字典理解将它变成一个漂亮的 Python 字典:

>>> {cell.xpath('string(.)').get():
...      cell.xpath('string(following-sibling::td[1])').get()
...  for row in response.css('div#details table tr')
...   for cell in row.css('td.title')}
{'Ref.: ': 'Trieste', 'Ad date: ': '\n\t\t  13/06/2017\t\t', 'Rooms: ': ' 4', 'Bathrooms:': ' 3', 'Floor area: ': ' 132m²', 'Heating: ': ' Autonomous', 'Terrace: ': ' Yes', 'Floor: ': '2', 'Total floors: ': '3', 'Garage:': ' no', 'Garden: ': 'Nothing', 'Condition: ': 'excellent/refurbished', 'Furniture: ': 'Partly Furnished', 'Property type: ': 'whole estate'}

>>> pprint(_)
{'Ad date: ': '\n\t\t  13/06/2017\t\t',
 'Bathrooms:': ' 3',
 'Condition: ': 'excellent/refurbished',
 'Floor area: ': ' 132m²',
 'Floor: ': '2',
 'Furniture: ': 'Partly Furnished',
 'Garage:': ' no',
 'Garden: ': 'Nothing',
 'Heating: ': ' Autonomous',
 'Property type: ': 'whole estate',
 'Ref.: ': 'Trieste',
 'Rooms: ': ' 4',
 'Terrace: ': ' Yes',
 'Total floors: ': '3'}

我在这里使用 XPath string()获取每个<td>的文本内容单元格，但我也可以使用 normalize-space()摆脱额外的空白:

>>> {cell.xpath('normalize-space(.)').get():
...      cell.xpath('normalize-space(following-sibling::td[1])').get()
...  for row in response.css('div#details table tr')
...   for cell in row.css('td.title')}
{'Ref.:': 'Trieste', 'Ad date:': '13/06/2017', 'Rooms:': '4', 'Bathrooms:': '3', 'Floor area:': '132m²', 'Heating:': 'Autonomous', 'Terrace:': 'Yes', 'Floor:': '2', 'Total floors:': '3', 'Garage:': 'no', 'Garden:': 'Nothing', 'Condition:': 'excellent/refurbished', 'Furniture:': 'Partly Furnished', 'Property type:': 'whole estate'}
>>> pprint(_)
{'Ad date:': '13/06/2017',
 'Bathrooms:': '3',
 'Condition:': 'excellent/refurbished',
 'Floor area:': '132m²',
 'Floor:': '2',
 'Furniture:': 'Partly Furnished',
 'Garage:': 'no',
 'Garden:': 'Nothing',
 'Heating:': 'Autonomous',
 'Property type:': 'whole estate',
 'Ref.:': 'Trieste',
 'Rooms:': '4',
 'Terrace:': 'Yes',
 'Total floors:': '3'}

关于python - 每页具有不同元素定位的抓取表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44585747/

文章推荐： ios - 如何在 iOS 中自定义 UISearchbar？

文章推荐： java - 为什么 serverSocket.accept() 只对一个请求执行多次

文章推荐： ios - 循环中的 NSRangeException

jQuery追加()定位
我必须从我的网站中删除()一些iem，然后将它们追加()回来，但是当我追加它们时，它们出现在不同的地方，而我希望它们完全显示在它们以前的同一个地方是。有什么解决办法吗？这是一个沙箱，请随意更新(注
CSS - 定位
一个。图片 (960x7)b. div(宽度:960，填充:10) 我想定位 (a)，使其距顶部 50 像素，居中。我想将 (b) 放置在 (a) 的正下方，没有空格。我的 CSS 如下: @cha
CSS 定位
放置某物的正确方法是什么？我有一个在中心显示博客文章的 div。 "" rel="bookmark"> BY LOUIS MOORE ON " pubdate>
CSS 定位
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开，visit the help center . 关闭1
CSS 定位
我已经成功地使用了 position:fixed 设置 CSS/CSS3 并且工作得很好! 我几天前看到了这个，想知道他们是如何实现向下滚动时发生的效果的，菜单栏在滚动前处于一个位置，然后转到顶部并自
CSS:定位
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。这个问题似乎与 help center 中定义的范围内的编程无关。 . 关闭 7 年前。 Improv
slate源码解析（三）-定位
接口定义能够对于文字、段落乃至任何元素的精准定位并做出增删改查，都是在开发一款富文本编辑器时一项最基本也是最重要的功能之一。让我们先来看看Slate中对于如何在文档树中定位元素是怎么定义的
WPF tabitem 定位
例如，使用 WPF 在选项卡控件的最左上角定位三个 tabitem 和在最右上角定位一个 tabitem 的正确方法是什么？我尝试通过更改边距将第四个 tabitem 向右移动，但这并没有产生好的结
javascript - 眼睛跟随光标 - 定位？
我正在尝试使用 Javascript 创建一个跟随鼠标在页面上移动的东西。我希望它是米老鼠，我希望他的眼睛跟随鼠标移动他的眼球...这是我到目前为止的代码(从网络上的各个地方收集，因此归功于编写该部分
arduino - 射频三角测量(定位)
已关闭。此问题不符合Stack Overflow guidelines 。目前不接受答案。这个问题似乎与 help center 中定义的范围内的编程无关。 . 已关闭 9 年前。 Improve
Java GUI - 定位
我试图将两个按钮放置在左上角。但它们始终位于顶部中心。我已经尝试过这个: jp = new JPanel(); jp.setLayout(new GridBagLayout()); GridBagC
jQuery Slide Down 定位
我在使用 JQuery 向下滑动功能时遇到问题。我可以让它正常工作，但是我向下滑动的元素的位置会根据视口(viewport)的大小而变化。我想做的是将它与它滑动的元素联系起来。 This JSfidd
javascript游戏 Sprite 定位
我正在尝试创建一个棋盘，并将其放置在屏幕中间，但到目前为止我无法将它直接放在中间。我不想将位置硬编码到屏幕上，因为我要处理不同的屏幕尺寸。 var winsize = cc.director.
mysql - 定位、子串查询
我正在尝试从 mysql 中的 2 个字符串点之间提取数据，我的示例脚本是 'otherdata&p1=textneeded&otherdata' 我需要拉出“textneeded”位，“P1=”是起
按钮的 JavaFX 定位
如何在 JavaFX 中设置按钮的位置？我的代码: bZero = new Button(); bZero.setPrefSize(45, 20); mainPane.getChildren().ad
iphone - 定位 UIScrollView
我有一个 iPhone 应用程序，我可以在其中显示一系列图像。当用户点击图像时，我需要将该图像带到第一个位置，表明它是所选图像。我可以通过子类化实现 uiscrollview 中的点击。但是我无法将
使用导航栏的 CSS 定位
在下图中，它显示了一个image、textbox 和一个css menu image 我的 CSS 菜单非常完美。我终于按照我需要的方式得到了它。我的问题是我需要导航栏中央的文本框，然后我需要我的图像
html - CSS 定位
我必须创建一个看起来像这样的 div id为2的div应该出现在图片的右下角，图片的大小不固定id=2的div应该应用什么css id =1 的 div 没有定义位置，所以使用默认值，图像也是
文本和图像在同一行的 CSS 定位
如何将我的文本和图像对齐在同一行？每当我使用 padding 或 margins 时，它就会崩溃到我正在使用的圆形图像中。 #alignPhoto { padding-right: 50px;
使用页面引用的 CSS 定位
简单的问题，如何定位具有整个页面引用的元素？在我的例子中，我在标题中得到了一个 float 图像，然后是 2 组标题。当我使用时: text-align: center; 它使用图像宽度端和页面其余

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 每页具有不同元素定位的抓取表