gpt4 book ai didi

python - 有没有办法从 SeLoger 中抓取广告的网址?

转载 作者:行者123 更新时间:2023-12-01 07:28:28 25 4
gpt4 key购买 nike

我正在尝试抓取法国网站 SeLoger,我可以找到并抓取所有广告并将其放入 Json 中。问题是我无法用这种方式找到广告的最终网址。该 Url 位于名为“cartouche”的 div 中,其类为 c-pa-link link_AB。


import requests
from bs4 import BeautifulSoup
import json


url = 'https://www.seloger.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&idtt=2,5&naturebien=1,2,4&ci=440109'
headers = {
'User-Agent': '*',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}


s = requests.Session()
s.headers.update(headers)

r = s.get(url)

soup = BeautifulSoup(r.text, 'html.parser')

for script_item in soup.find_all('script'):
if 'var ava_data' in script_item.text:
raw_json = script_item.text.split('var ava_data = ')[1].split('};')[0] + "}"


data = json.loads(raw_json)

print(data)

我希望像这样在 json 中放置一个字段。


{
"url":"https://www.seloger.com/annonces/achat/appartement/nantes-44/centre-ville/144279775.htm?enterprise=0&natures=1,4&places=%5b%7bci%3a440109%7d%5d&projects=2,5&qsversion=1.0&types=1,2&bd=ListToDetail",
"idannonce": "149546457",
"idagence": "294918",
"idtiers": "323172",
"typedebien": "Appartement",
"typedetransaction": [
"viager"
],
"idtypepublicationsourcecouplage": "SL",
"position": "2",
"codepostal": "44100",
"ville": "Nantes",
"departement": "Loire-Atlantique",
"codeinsee": "440109",
"produitsvisibilite": "AD:AC:BX:AW",
"affichagetype": [
{
"name": "liste",
"value": "True"
}
],
"cp": "44100",
"etage": "0",
"idtypechauffage": "0",
"idtypecommerce": "0",
"idtypecuisine": "séparée équipée",
"naturebien": "1",
"si_balcon": "1",
"nb_chambres": "1",
"nb_pieces": "2",
"si_sdbain": "0",
"si_sdEau": "0",
"nb_photos": "15",
"prix": "32180",
"surface": "41"
}

感谢您的帮助。

最佳答案

您可以使用 zip() 函数将产品从 json 数据“绑定(bind)”到网页中的 URL:

import requests
from bs4 import BeautifulSoup
import json

url = 'https://www.seloger.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&idtt=2,5&naturebien=1,2,4&ci=440109'
headers = {
'User-Agent': '*',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}

s = requests.Session()
s.headers.update(headers)

r = s.get(url)

soup = BeautifulSoup(r.text, 'html.parser')

for script_item in soup.find_all('script'):
if 'var ava_data' in script_item.text:
raw_json = script_item.text.split('var ava_data = ')[1].split('};')[0] + "}"

data = json.loads(raw_json)

for a, p in zip(soup.select('.c-pa-info > a'), data['products']):
p['url'] = a['href']

print(json.dumps(data, indent=4))

打印:

...

{
"idannonce": "139994713",
"idagence": "48074",
"idtiers": "24082",
"typedebien": "Appartement",
"typedetransaction": [
"vente"
],
"idtypepublicationsourcecouplage": "SL9",
"position": "16",
"codepostal": "44000",
"ville": "Nantes",
"departement": "Loire-Atlantique",
"codeinsee": "440109",
"produitsvisibilite": "AM:AC:BB:BX:AW",
"affichagetype": [
{
"name": "liste",
"value": true
}
],
"cp": "44000",
"etage": "0",
"idtypechauffage": "0",
"idtypecommerce": "0",
"idtypecuisine": "0",
"naturebien": "2",
"si_balcon": "0",
"nb_chambres": "0",
"nb_pieces": "3",
"si_sdbain": "0",
"si_sdEau": "0",
"nb_photos": "4",
"prix": "147900",
"surface": "63",
"url": "https://www.selogerneuf.com/annonces/achat/appartement/nantes-44/139994713/#?cmp=INTSL_ListToDetail"
},
{
"idannonce": "146486955",
"idagence": "334754",

...

注意:某些 URL 的结构与

不同
https://www.seloger.com/annonces/achat/appartement/nantes-44/centre-ville/{idannonce}.htm?ci=440109&enterprise=0&idtt=2,5&idtypebien=2,1&naturebien=1,2,4&tri=initial&bd=ListToDetail

例如

https://www.selogerneuf.com/annonces/investissement/appartement/nantes-44/146486955/#?cmp=INTSL_ListToDetail

关于python - 有没有办法从 SeLoger 中抓取广告的网址?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57329121/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com