gpt4 book ai didi

python-3.x - 如何从网页的图形中提取数据?

转载 作者:行者123 更新时间:2023-12-04 08:58:15 25 4
gpt4 key购买 nike

我正在尝试从网页中抓取图表数据:“https://cawp.rutgers.edu/women-percentage-2020-candidates”

我尝试使用以下代码从 Graph 中提取数据:

import requests
from bs4 import BeautifulSoup

Res = requests.get('https://cawp.rutgers.edu/women-percentage-2020-candidates').text
soup = BeautifulSoup(Res, "html.parser")

Values= [i.text for i in soup.findAll('g', {'class': 'igc-graph'}) if i]
Dates = [i.text for i in soup.findAll('g', {'class': 'igc-legend-entry'}) if i]

print(Values, Dates) ## both list are empty
Data= pd.DataFrame({'Value':Values,'Date':Dates}) ## Returning an Empty Dataframe

我想从所有 4 条形图中提取日期和值。请任何人建议我在这里必须做什么来提取图形数据,或者是否有任何其他方法可以尝试提取数据。谢谢;

最佳答案

此图位于此 url 上:https://e.infogram.com/5bb50948-04b2-4113-82e6-5e5f06236538

如果您查找具有 infogram-embed 类的 div,您可以直接在原始 url 上找到 infogram id(目标 url 的路径),其属性值为 data-id:

<div class="infogram-embed" data-id="5bb50948-04b2-4113-82e6-5e5f06236538" data-title="Candidate Tracker 2020_US House_Proportions" data-type="interactive"> </div>

从这个 url,它在 javascript 中加载一个静态 JSON。您可以使用正则表达式提取它并解析 JSON 结构以获取行/列和不同的表:

import requests
from bs4 import BeautifulSoup
import re
import json

original_url = "https://cawp.rutgers.edu/women-percentage-2020-candidates"
r = requests.get(original_url)
soup = BeautifulSoup(r.text, "html.parser")

infogram_url = f'https://e.infogram.com/{soup.find("div",{"class":"infogram-embed"})["data-id"]}'
r = requests.get(infogram_url)
soup = BeautifulSoup(r.text, "html.parser")

script = [
t
for t in soup.findAll("script")
if "window.infographicData" in t.text
][0].text

extract = re.search(r".*window\.infographicData=(.*);$", script)

data = json.loads(extract.group(1))

entities = data["elements"]["content"]["content"]["entities"]

tables = [
(entities[key]["props"]["chartData"]["sheetnames"], entities[key]["props"]["chartData"]["data"])
for key in entities.keys()
if ("props" in entities[key]) and ("chartData" in entities[key]["props"])
]

data = []
for t in tables:
for i, sheet in enumerate(t[0]):
data.append({
"sheetName": sheet,
"table": dict([(t[1][i][0][j],t[1][i][1][j]) for j in range(len(t[1][i][0])) ])
})
print(data)

输出:

[{'sheetName': 'Sheet 1',
'table': {'': '2020', 'Districts Already Filed': '435'}},
{'sheetName': 'All',
'table': {'': 'Filed', '2016': '17.8%', '2018': '24.2%', '2020': '29.1%'}},
{'sheetName': 'Democrats Only',
'table': {'': 'Filed', '2016': '25.1%', '2018': '32.5%', '2020': '37.9%'}},
{'sheetName': 'Republicans Only',
'table': {'': 'Filed', '2016': '11.5%', '2018': '13.7%', '2020': '21.3%'}}]

关于python-3.x - 如何从网页的图形中提取数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63700789/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com