我正在抓取一个网站,我想生成一个 xml,其中每个区域都位于其所属的国家/地区内。
def parse(self, response):
#here i parse the country names
country_names = response.xpath('//some countries/text()').extract()
for name_of_country in countries :
yield {"Country": name_of_country }
yield Request(country_url, callback=self.parse_regions)
def parse_regions(self, response):
#here i parse the regions of each country
regions= response.xpath('//some regions/text()').extract()
for region in regions
yield {"Region": region }
现在 XML 显示如下:
<Country1></Country1>
<Country2></Country2>
<Region>Region1</Region>
<Region>Region2</Region>
<Region>Region3</Region>
<Region>Region1</Region>
<Region>Region2</Region>
<Region>Region3</Region>
我希望 XML 显示如下:
<Country1>
<Region>Region1</Region>
<Region>Region2</Region>
<Region>Region3</Region>
</Country1>
<Country2>
<Region>Region1</Region>
<Region>Region2</Region>
<Region>Region3</Region>
</Country2>
我从未使用过XML
但您可以发送Country
到第二个请求(使用 meta=
),然后在 parse_region
中创建一个包含所有数据的元素。
我使用http://quotes.toscrape.com获取一些标签并用作 Country
然后我发送到parse_region
它获取所有区域并仅产生一个元素。
解决方案并不完美,因为它给出了
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<Country books>
<value>“The person, ...”</value>
<value>“Good friends, ...”</value>
</Country books>
</item>
<item>
<Country humor>
<value>“The person, ...”</value>
<value>“A day without ...”</value>
</Country humor>
</item>
</items>
也许您可以通过自己的导出商更改 <value>
进入<region>
并删除 <item>
- 参见Formatting Scrapy's output to XML
完整的工作示例
#!/usr/bin/env python3
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
print('url:', response.url)
for quote in response.css('.tag-item a'):
country = 'Country ' + quote.css('::text').extract_first()
url = quote.css('::attr(href)').extract_first()
url = response.urljoin(url)
#print('country/url:', country, url)
# send `country` to `parse_region`
yield scrapy.Request(url, meta={'country': country}, callback=self.parse_region)
def parse_region(self, response):
print('url:', response.url)
country = response.meta['country']
all_regions = response.css('.quote .text ::text').extract()
#for region in all_regions:
# print('--- region ---')
# print(region)
# create one `<countr>` with all <regions>`
yield {country: all_regions}
# --- it runs without project and saves in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in XML, CSV or JSON
'FEED_FORMAT': 'xml', # 'json, csv
'FEED_URI': 'output.xml', # 'output.json, output.csv
})
c.crawl(MySpider)
c.start()
我是一名优秀的程序员,十分优秀!