gpt4 book ai didi

python - 使用 beautifulsoup 从页面中抓取表格,找不到表格

转载 作者:行者123 更新时间:2023-11-28 21:43:22 25 4
gpt4 key购买 nike

我一直在尝试从 here但在我看来,BeautifulSoup 没有找到任何表格。

我写道:

import requests
import pandas as pd
from bs4 import BeautifulSoup
import csv

url = "http://www.payscale.com/college-salary-report/bachelors?page=65"
r=requests.get(url)
data=r.text

soup=BeautifulSoup(data,'xml')
table=soup.find_all('table')
print table #prints nothing..

基于其他类似的问题,我假设 HTML 在某种程度上被破坏了,但我不是专家..在这些问题中找不到答案:( Beautiful soup missing some html table tags ),( Extracting a table from a website ),( Scraping a table using BeautifulSoup ), 甚至 ( Python+BeautifulSoup: scraping a particular table from a webpage )

非常感谢!

最佳答案

你正在解析 html 但你使用了 xml 解析器。
你应该使用 soup=BeautifulSoup(data,"html.parser")
您需要的数据在 script 标签中,实际上并没有 table 标签。因此,您需要在 script 中查找文本。
注意:如果您使用的是 Python 2.x,则使用“HTMLParser”而不是“html.parser”。

这是代码。

import csv
import requests
from bs4 import BeautifulSoup

url = "http://www.payscale.com/college-salary-report/bachelors?page=65"
r=requests.get(url)
data=r.text

soup=BeautifulSoup(data,"html.parser")
scripts = soup.find_all("script")

file_name = open("table.csv","w",newline="")
writer = csv.writer(file_name)
list_to_write = []

list_to_write.append(["Rank","School Name","School Type","Early Career Median Pay","Mid-Career Median Pay","% High Job Meaning","% STEM"])

for script in scripts:
text = script.text
start = 0
end = 0
if(len(text) > 10000):
while(start > -1):
start = text.find('"School Name":"',start)
if(start == -1):
break
start += len('"School Name":"')
end = text.find('"',start)
school_name = text[start:end]

start = text.find('"Early Career Median Pay":"',start)
start += len('"Early Career Median Pay":"')
end = text.find('"',start)
early_pay = text[start:end]

start = text.find('"Mid-Career Median Pay":"',start)
start += len('"Mid-Career Median Pay":"')
end = text.find('"',start)
mid_pay = text[start:end]

start = text.find('"Rank":"',start)
start += len('"Rank":"')
end = text.find('"',start)
rank = text[start:end]

start = text.find('"% High Job Meaning":"',start)
start += len('"% High Job Meaning":"')
end = text.find('"',start)
high_job = text[start:end]

start = text.find('"School Type":"',start)
start += len('"School Type":"')
end = text.find('"',start)
school_type = text[start:end]

start = text.find('"% STEM":"',start)
start += len('"% STEM":"')
end = text.find('"',start)
stem = text[start:end]

list_to_write.append([rank,school_name,school_type,early_pay,mid_pay,high_job,stem])
writer.writerows(list_to_write)
file_name.close()

这将在 csv 中生成您需要的表格。完成后不要忘记关闭文件。

关于python - 使用 beautifulsoup 从页面中抓取表格,找不到表格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42310252/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com