gpt4 book ai didi

python - 我想从文件中读取并抓取每个 URL

转载 作者:行者123 更新时间:2023-12-01 08:22:07 27 4
gpt4 key购买 nike

我想要做的是从文件中读取每个 URL 并抓取这个 URL。之后,我将抓取的数据移动到类WebRealTor中,然后将数据序列化为json,最后将所有数据保存在一个json文件中。这是文件的内容: https://www.seloger.com/annonces/achat/appartement/paris-14eme-75/montsouris-dareau/143580615.htm?ci=750114&idtt=2,5&idtypebien=2,1&LISTING-LISTpg=8&naturebien=1,2,4&tri=initial&bd=ListToDetail https://www.seloger.com/annonces/achat/appartement/montpellier-34/gambetta/137987697.htm?ci=340172&idtt=2,5&idtypebien=1,2&naturebien=1,2,4&tri=initial&bd=ListToDetail https://www.seloger.com/annonces/achat/appartement/montpellier-34/celleneuve/142626025.htm?ci=340172&idtt=2,5&idtypebien=1,2&naturebien=1,2,4&tri=initial&bd=ListToDetail https://www.seloger.com/annonces/achat/appartement/versailles-78/domaine-national-du-chateau/138291887.htm

我的脚本是:

import scrapy
import json



class selogerSpider(scrapy.Spider):
name = "realtor"

custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}
}

def start_requests(self):
with open("annonces.txt", "r") as file:
for line in file.readlines():
yield scrapy.Request(line)
def parse(self, response):
name = response.css(".agence-link::text").extract_first()
address = response.css(".agence-adresse::text").extract_first()

XPATH_siren = ".//div[@class='legalNoticeAgency']//p/text()"
siren = response.xpath(XPATH_siren).extract_first()

XPATH_website = ".//div[@class='agence-links']//a/@href"
site = response.xpath(XPATH_website).extract()

XPATH_phone = ".//div[@class='contact g-row-50']//div[@class='g-col g-50 u-pad-0']//button[@class='btn-phone b-btn b-second fi fi-phone tagClick']/@data-phone"
phone = response.xpath(XPATH_phone).extract_first()


yield {
'Agency_Name =': name,
'Agency_Address =': address,
'Agency_profile_website =': site,
'Agency_number =': phone,
'Agency_siren =': siren
}

file.close()


class WebRealTor:

def __name__(self):
self.nom = selogerSpider.name
def __address__(self):
self.adress = selogerSpider.address
def __sirenn__(self):
self.sire = selogerSpider.siren
def __numero__(self):
self.numero = selogerSpider.phone


with open('data.txt', 'w') as outfile:
json.dump(data, outfile)

最佳答案

尝试将所有内容移至类中的 start_requests 中。像这样:

def start_requests(self):
with open("annonces.txt", "r") as file:
for line in file.readlines():
yield scrapy.Request(line) # self.parse is by default

def parse(self, response):
# each link parsing as you already did

关于python - 我想从文件中读取并抓取每个 URL,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54569525/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com