gpt4 book ai didi

parsing - Scrapy:将列表项解析到单独的行中

转载 作者:行者123 更新时间:2023-12-03 16:06:06 25 4
gpt4 key购买 nike

试图使对this question的答案适应我的问题,但未成功。

这是一些示例html代码:

<div id="provider-region-addresses">
<h3>Contact details</h3>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>More information</dt>
<dd>North Shore Hospital</dd><dt>Physical address</dt>
<dd>124 Shakespeare Rd, Takapuna, Auckland 0620</dd><dt>Postal address</dt>
<dd>Private Bag 93503, Takapuna, Auckland 0740</dd><dt>Postcode</dt>
<dd>0740</dd><dt>District/town</dt>

<dd>
North Shore, Takapuna</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 486 8996</dd><dt>Fax</dt>
<dd>(09) 486 8342</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>Physical address</dt>
<dd>Helensville</dd><dt>Postal address</dt>
<dd>PO Box 13, Helensville 0840</dd><dt>Postcode</dt>
<dd>0840</dd><dt>District/town</dt>

<dd>
Rodney, Helensville</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 420 9450</dd><dt>Fax</dt>
<dd>(09) 420 7050</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>Physical address</dt>
<dd>Warkworth</dd><dt>Postal address</dt>
<dd>PO Box 505, Warkworth 0941</dd><dt>Postcode</dt>
<dd>0941</dd><dt>District/town</dt>

<dd>
Rodney, Warkworth</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 422 2700</dd><dt>Fax</dt>
<dd>(09) 422 2709</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>More information</dt>
<dd>Waitakere Hospital</dd><dt>Physical address</dt>
<dd>55-75 Lincoln Rd, Henderson, Auckland 0610</dd><dt>Postal address</dt>
<dd>Private Bag 93115, Henderson, Auckland 0650</dd><dt>Postcode</dt>
<dd>0650</dd><dt>District/town</dt>

<dd>
Waitakere, Henderson</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 839 0000</dd><dt>Fax</dt>
<dd>(09) 837 6634</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
<h2 class="toggler nohide">Auckland</h2>
<dl class="clear">
<dt>More information</dt>
<dd>Hibiscus Coast Community Health Centre</dd><dt>Physical address</dt>
<dd>136 Whangaparaoa Rd, Red Beach 0932</dd><dt>Postcode</dt>
<dd>0932</dd><dt>District/town</dt>

<dd>
Rodney, Red Beach</dd><dt>Region</dt>
<dd>Auckland</dd><dt>Phone</dt>
<dd>(09) 427 0300</dd><dt>Fax</dt>
<dd>(09) 427 0391</dd><dt>Website</dt>
<dd><a target="_blank" href="http://www.healthpoint.co.nz/default,61031.sm">http://www.healthpoint.co.nz/default,61031...</a></dd>
</dl>
</div>



Search again


这是我的蜘蛛;

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from webhealth.items1 import WebhealthItem1

class WebhealthSpider(BaseSpider):

name = "webhealth_content1"

download_delay = 5

allowed_domains = ["webhealth.co.nz"]
start_urls = [
"http://auckland.webhealth.co.nz/provider/service/view/914136/"
]

def parse(self, response):
hxs = HtmlXPathSelector(response)
results = hxs.select('//*[@id="content"]/div[1]')
items1 = []
for result in results:
item = WebhealthItem1()
item['url'] = result.select('//dl/a/@href').extract()
item['practice'] = result.select('//h1/text()').extract()
item['hours'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text()').extract())
item['more_hours'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"More information")]/following-sibling::dd[1]/text()').extract())
item['physical_address'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Physical address")]/following-sibling::dd[1]/text()').extract())
item['postal_address'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Postal address")]/following-sibling::dd[1]/text()').extract())
item['postcode'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Postcode")]/following-sibling::dd[1]/text()').extract())
item['district_town'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"District/town")]/following-sibling::dd[1]/text()').extract())
item['region'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Region")]/following-sibling::dd[1]/text()').extract())
item['phone'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Phone")]/following-sibling::dd[1]/text()').extract())
item['website'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Website")]/following-sibling::dd[1]/a/@href').extract())
item['email'] = map(unicode.strip, result.select('//div/dl/dt[contains(text(),"Email")]/following-sibling::dd[1]/a/text()').extract())
items1.append(item)
return items1


从这里开始,如何将列表项解析为单独的行,并在名称字段中使用相应的 //h1/text()值?目前,我在一个单元格中获得了每个Xpath项目的列表。这与我声明Xpath的方式有关吗?

谢谢

最佳答案

首先,您正在使用results = hxs.select('//*[@id="content"]/div[1]')

    results = hxs.select('//*[@id="content"]/div[1]')
for result in results:
...


将仅在一个 div上循环, div的第一个子 <div id="content" class="clear">

您是否需要在此 <dl class="clear">...</dl>中的每个 //*[@id="content"]/div[1]上循环(使用 //*[@id="content"]/div[@class="content"]维护起来可能会更容易)

        results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')


其次,在每次循环迭代中,您都使用绝对XPath表达式( //div...

result.select('//div/dl/dt[contains(text(), "...")]/following-sibling::dd[1]/text()')


这将从文档根节点开始选择与文本内容匹配的所有 dd,然后是 dt

有关详细信息,请参见 this section in Scrapy docs

您需要使用相对的XPath表达式-在每个 result范围内代表每个 dl的相对表达式,例如 dt[contains(text(),"Contact hours")]/following-sibling::dd[1]/text()./dt[contains(text(), "Contact hours")]/following-sibling::dd[1]/text()

但是,“实践”字段仍可以使用绝对XPath表达式 //h1/text(),但是您也可以设置一次变量 practice,并在每个 WebhealthItem1()实例中使用它

        ...
practice = hxs.select('//h1/text()').extract()
for result in results:
item = WebhealthItem1()
...
item['practice'] = practice


这些更改会使您的蜘蛛看起来像这样:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from webhealth.items1 import WebhealthItem1

class WebhealthSpider(BaseSpider):

name = "webhealth_content1"

download_delay = 5

allowed_domains = ["webhealth.co.nz"]
start_urls = [
"http://auckland.webhealth.co.nz/provider/service/view/914136/"
]

def parse(self, response):
hxs = HtmlXPathSelector(response)

practice = hxs.select('//h1/text()').extract()
items1 = []

results = hxs.select('//*[@id="content"]/div[@class="content"]/div/dl')
for result in results:
item = WebhealthItem1()
#item['url'] = result.select('//dl/a/@href').extract()
item['practice'] = practice
item['hours'] = map(unicode.strip,
result.select('dt[contains(.," Contact hours")]/following-sibling::dd[1]/text()').extract())
item['more_hours'] = map(unicode.strip,
result.select('dt[contains(., "More information")]/following-sibling::dd[1]/text()').extract())
item['physical_address'] = map(unicode.strip,
result.select('dt[contains(., "Physical address")]/following-sibling::dd[1]/text()').extract())
item['postal_address'] = map(unicode.strip,
result.select('dt[contains(., "Postal address")]/following-sibling::dd[1]/text()').extract())
item['postcode'] = map(unicode.strip,
result.select('dt[contains(., "Postcode")]/following-sibling::dd[1]/text()').extract())
item['district_town'] = map(unicode.strip,
result.select('dt[contains(., "District/town")]/following-sibling::dd[1]/text()').extract())
item['region'] = map(unicode.strip,
result.select('dt[contains(., "Region")]/following-sibling::dd[1]/text()').extract())
item['phone'] = map(unicode.strip,
result.select('dt[contains(., "Phone")]/following-sibling::dd[1]/text()').extract())
item['website'] = map(unicode.strip,
result.select('dt[contains(., "Website")]/following-sibling::dd[1]/a/@href').extract())
item['email'] = map(unicode.strip,
result.select('dt[contains(., "Email")]/following-sibling::dd[1]/a/text()').extract())
items1.append(item)
return items1


我还使用此代码创建了Cloud9 IDE项目。您可以在 https://c9.io/redapple/so_19309960上玩它

关于parsing - Scrapy:将列表项解析到单独的行中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19309960/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com