gpt4 book ai didi

Python 3 : how to scrape research results from a website using CSFR?

转载 作者:行者123 更新时间:2023-12-01 07:24:11 25 4
gpt4 key购买 nike

我正在尝试抓取列出法国众贷金融科技网站的研究成果:https://www.orias.fr/web/guest/search

手动执行此操作,我在单选按钮中选择 (IFP),然后它为我提供 13 个结果页,每页 10 个结果。每个结果都有一个超链接,我还想从决赛表中获取信息。

我的主要问题似乎来自CSRF,在结果地址中,有:p_auth=8mxk0SsK因此,我不能简单地通过将链接中的“p=2”更改为“p=13”来循环浏览结果页面: https://www.orias.fr/search?p_auth=8mxk0SsK&p_p_id=intermediaryDetailedSearch_WAR_oriasportlet&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_intermediaryDetailedSearch_WAR_oriasportlet_myaction=fullSearch

如果我尝试手动使用 VPN,网站地址会变得“稳定”: https://www.orias.fr/search?p_p_id=intermediaryDetailedSearch_WAR_oriasportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_intermediaryDetailedSearch_WAR_oriasportlet_d-16544-p=1&_intermediaryDetailedSearch_WAR_oriasportlet_implicitModel=true&_intermediaryDetailedSearch_WAR_oriasportlet_spring_render=searchResult

所以我尝试在Python代码中使用它:

    import requests
from bs4 import BeautifulSoup

k = 1
% test k from 1 to 13


url = "http://www.orias.fr/search?p_p_id=intermediaryDetailedSearch_WAR_oriasportlet&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_intermediaryDetailedSearch_WAR_oriasportlet_d-16544-p=" + str(k) + "&_intermediaryDetailedSearch_WAR_oriasportlet_implicitModel=true&_intermediaryDetailedSearch_WAR_oriasportlet_spring_render=searchResult"
response = requests.get(url, proxies=proxies) # 200 ment it went through
soup = BeautifulSoup(response.text, "html.parser")

table = soup.find('table', attrs={'class':'table table-condensed table-striped table-bordered'})
table_rows = table.find_all('tr')

l = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
l.append(row)

这不像在网络浏览器中那样工作,它只是提供一个页面,就好像没有请求结果一样。你知道如何让它发挥作用吗?

最佳答案

我会在循环期间更改发布请求中的页面参数。执行初始请求以查明页数

from bs4 import BeautifulSoup as bs
import requests, re, math
import pandas as pd

headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Referer': 'https://www.orias.fr/web/guest/search'
}

params = [['p_p_id', 'intermediaryDetailedSearch_WAR_oriasportlet'],
['p_p_lifecycle', '0'],
['p_p_state', 'normal'],
['p_p_mode', 'view'],
['p_p_col_id', 'column-1'],
['p_p_col_count', '1'],
['_intermediaryDetailedSearch_WAR_oriasportlet_d-16544-p', '1'],
['_intermediaryDetailedSearch_WAR_oriasportlet_implicitModel', 'true'],
['_intermediaryDetailedSearch_WAR_oriasportlet_spring_render', 'searchResult']]

data = {
'searchString': '',
'address': '',
'zipCodeOrCity': '',
'_coa': 'on',
'_aga': 'on',
'_ma': 'on',
'_mia': 'on',
'_euIAS': 'on',
'mandatorDenomination': '',
'wantsMandator': 'no',
'_cobsp': 'on',
'_mobspl': 'on',
'_mobsp': 'on',
'_miobsp': 'on',
'_bankActivities': '1',
'_euIOBSP': 'on',
'_cif': 'on',
'_alpsi': 'on',
'_cip': 'on',
'ifp': 'true',
'_ifp': 'on',
'submit': 'Search'
}

p = re.compile(r'(\d+)\s+intermediaries found')

with requests.Session() as s:
r= requests.post('https://www.orias.fr/search', headers=headers, params= params, data=data)
soup = bs(r.content, 'lxml')
num_results = int(p.findall(r.text)[0])
results_per_page = 20
num_pages = math.ceil(num_results/results_per_page)
df = pd.read_html(str(soup.select_one('.table')))[0]

for i in range(2, num_pages + 1):
params[6][1] = str(i)
r= requests.post('https://www.orias.fr/search', headers=headers, params= params, data=data)
soup = bs(r.content, 'lxml')
df_next = pd.read_html(str(soup.select_one('.table')))[0]
df = pd.concat([df, df_next])

df.drop('Unnamed: 6', axis = 1, inplace = True)
df.reset_index(drop=True)
<小时/>

检查:

print(len(df['Siren Number'].unique()))
#245

关于Python 3 : how to scrape research results from a website using CSFR?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57536576/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com