gpt4 book ai didi

python - 无法使用请求从网页中获取表格内容

转载 作者:行者123 更新时间:2023-12-04 01:12:52 25 4
gpt4 key购买 nike

我使用请求库创建了一个脚本来获取网页中可用的表格内容。当我使用此 link 手动访问该站点时,我看到一个页面,我需要先点击 AGREE 按钮才能看到表格内容。

这又是website link

我试图在 chrome 开发工具的网络部分仔细观察,并使用下面的脚本模仿相同的内容来访问内容。但是,我得到的只是以下内容,而我应该根据开发工具以某种 json 格式获取表格内容。

我得到的输出:

b'\n\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\n\n\n{}'

预期输出(截断):

{T:{"Columns":[{"tradeQuantity":"1125000","quantityAsString":"1125000",

我试过:

import json
import requests

start_url = 'https://finra-markets.morningstar.com/BondCenter/BondTradeActivitySearchResult.jsp?'
link = 'https://finra-markets.morningstar.com/bondSearch.jsp'

qsp = {
'ticker': 'C679131',
'startdate': '10/03/2019',
'enddate': '10/03/2020'
}

payload = {
'postData': {'Keywords':[]},
'ticker': 'C679131',
'startDate': '',
'endDate': '',
'showResultsAs': 'B',
'debtOrAssetClass': '',
'spdsType': ''
}

params = {
'count': '20',
'sortfield': 'tradeDate',
'sorttype': '2',
'start': '0',
'searchtype': 'T',
'query': {"Keywords":[{"Name":"securityId","Value":"C679131"},{"Name":"tradeDate","minValue":"10/03/2019","maxValue":"10/03/2020"}]}
}

with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'
s.headers['Referer'] = 'https://finra-markets.morningstar.com/BondCenter/UserAgreement.jsp'
r = s.post(start_url,params=qsp,data=payload)
s.headers['Referer'] = 'https://finra-markets.morningstar.com/BondCenter/BondTradeActivitySearchResult.jsp?ticker=C679131&startdate=10%2F03%2F2019&enddate=10%2F03%2F2020'
s.headers['X-Requested-With'] = 'XMLHttpRequest'
r = s.post(link,json=params)
print(r.status_code)
print(r.content)

How can I get the tabular content from that webpage using requests?

最佳答案

您需要调用电话:

POST https://finra-markets.morningstar.com/finralogin.jsp

同时使用 requests.Session() 存储 cookie。此外,调用 :

需要 Referer header
POST https://finra-markets.morningstar.com/bondSearch.jsp

在那之后,结果并不完全是 baduker 指出的 JSON,您可以使用正则表达式对其进行改造:

import requests
from urllib import parse
import json
import re
import pandas as pd

host = "https://finra-markets.morningstar.com"
path = "/BondCenter/BondTradeActivitySearchResult.jsp"

qsp = {
'ticker': 'C679131',
'startdate': '10/03/2019',
'enddate': '10/03/2020'
}
s = requests.Session()

s.post("https://finra-markets.morningstar.com/finralogin.jsp",
data = {
"redirectPage": f"{path}?{parse.urlencode(qsp)}"
}
)
r = s.post("https://finra-markets.morningstar.com/bondSearch.jsp",
headers= {
"Referer": f"{host}{path}?{parse.urlencode(qsp)}",
},
data = {
"count": 20,
"sortfield": "tradeDate",
"sorttype": 2,
"start": 0,
"searchtype": "T",
"query": json.dumps({
"Keywords":[
{"Name":"securityId","Value": qsp["ticker"]},
{"Name":"tradeDate","minValue": qsp["startdate"],"maxValue":qsp["enddate"]}
]
})
})

dataReg = re.search('{T:(.*)}', r.text, re.MULTILINE)
data = json.loads(dataReg.group(1))

df = pd.DataFrame(data["Columns"])

print(df)

Try this on repl.it

输出:

   tradeQuantity quantityAsString timeOfExecution settlementDate tradeModifier secondModifier specialPriceIndicator  ...  tradeDate symbol cusip callable commissionIndicator ATSIndicator remuneration
0 1125000 1125000 11:46:02 10/2/2020 _ _ - ... 10/2/2020 None None None N N
1 60000 60000 10:23:55 10/5/2020 _ _ - ... 10/1/2020 None None None N N
2 60000 60000 10:23:54 10/5/2020 _ _ - ... 10/1/2020 None None None M M
3 200000 200000 16:27:43 10/2/2020 _ _ - ... 9/30/2020 None None None
4 200000 200000 16:27:43 10/2/2020 _ _ - ... 9/30/2020 None None None N N
5 2900000 2900000 15:39:16 10/2/2020 _ _ - ... 9/30/2020 None None None M M
6 20000 20000 12:24:48 10/2/2020 _ _ - ... 9/30/2020 None None None M M
.........

在 Chrome 开发者控制台中,在网络选项卡中,您可以右键单击:“headers options/Set Cookies”以快速捕获正在设置 cookies 的调用

关于python - 无法使用请求从网页中获取表格内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64268498/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com