gpt4 book ai didi

python - Scrapy 无法正确收集电子邮件

转载 作者:太空宇宙 更新时间:2023-11-03 17:35:00 24 4
gpt4 key购买 nike

我正在使用 Scrapy 收集一些数据,除了电子邮件提取部分之外,一切正常。由于某种原因,.csv 文件中的电子邮件行为空或仅提取了几封电子邮件。我尝试限制 download_delay 和 CLOSESPIDER_ITEMCOUNT 但它不起作用。非常感谢任何帮助。

import re
import scrapy


class DmozItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
attr = scrapy.Field()
title = scrapy.Field()
tag = scrapy.Field()


class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["hanford.craigslist.org"]
start_urls = [
"http://hanford.craigslist.org/search/cto?min_auto_year=1980&min_price=3000"
]

BASE_URL = 'http://hanford.craigslist.org/'

def parse(self, response):
links = response.xpath('//a[@class="hdrlnk"]/@href').extract()
for link in links:
absolute_url = self.BASE_URL + link
yield scrapy.Request(absolute_url, callback=self.parse_attr)

def parse_attr(self, response):
match = re.search(r"(\w+)\.html", response.url)
if match:
item_id = match.group(1)
url = self.BASE_URL + "reply/sdo/cto/" + item_id

item = DmozItem()
item["link"] = response.url
item["title"] = "".join(response.xpath("//span[@class='postingtitletext']//text()").extract())
item["tag"] = "".join(response.xpath("//p[@class='attrgroup']/span/b/text()").extract()[0])
return scrapy.Request(url, meta={'item': item}, callback=self.parse_contact)

def parse_contact(self, response):
item = response.meta['item']
item["attr"] = "".join(response.xpath("//div[@class='anonemail']//text()").extract())
return item

最佳答案

首先,引用Terms of Use作为警告:

USE. You agree not to use or provide software (except for general purpose web browsers and email clients, or software expressly licensed by us) or services that interact or interoperate with CL, e.g. for downloading, uploading, posting, flagging, emailing, search, or mobile use. Robots, spiders, scripts, scrapers, crawlers, etc. are prohibited, as are misleading, unsolicited, unlawful, and/or spam postings/email. You agree not to collect users' personal and/or contact information ("PI").

这里需要解决的一些问题:

  • 联系信息位于 reply/hnf/cto/ 下,而不是 reply/sdo/cto/
  • 指定 User-AgentX-Requested-With header

适合我的完整代码:

import re
from urlparse import urljoin

import scrapy


class DmozItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
attr = scrapy.Field()
title = scrapy.Field()
tag = scrapy.Field()


class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["hanford.craigslist.org"]
start_urls = [
"http://hanford.craigslist.org/search/cto?min_auto_year=1980&min_price=3000"
]

BASE_URL = 'http://hanford.craigslist.org/'

def parse(self, response):
links = response.xpath('//a[@class="hdrlnk"]/@href').extract()
for link in links:
absolute_url = urljoin(self.BASE_URL, link)
yield scrapy.Request(absolute_url,
callback=self.parse_attr)

def parse_attr(self, response):
match = re.search(r"(\w+)\.html", response.url)
if match:
item_id = match.group(1)
url = urljoin(self.BASE_URL, "reply/hnf/cto/" + item_id)

item = DmozItem()
item["link"] = response.url
item["title"] = "".join(response.xpath("//span[@class='postingtitletext']//text()").extract())
item["tag"] = "".join(response.xpath("//p[@class='attrgroup']/span/b/text()").extract()[0])
return scrapy.Request(url,
meta={'item': item},
callback=self.parse_contact,
headers={"X-Requested-With": "XMLHttpRequest",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.132 Safari/537.36"})

def parse_contact(self, response):
item = response.meta['item']
item["attr"] = "".join(response.xpath("//div[@class='anonemail']//text()").extract())
return item

关于python - Scrapy 无法正确收集电子邮件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31318189/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com