gpt4 book ai didi

python - 在 Python 中抓取 TABLE I NEED
之间的所有文本

转载 作者:行者123 更新时间:2023-12-04 07:19:12 27 4
gpt4 key购买 nike

我试图从以下内容中抓取 URL从 WorldOMeter 获取 CoVid 数据,在此页面上存在一个表,id 为:main_table_countries_today其中包含我希望收集的 15x225 (3,375) 个数据单元格。
我尝试了几种方法,但让我分享我认为我所做的最接近的尝试:

import requests
from os import system

url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'


# Refreshes the Terminal Emulator window
def clear_screen():

def bash_input(user_in):
_ = system(user_in)

bash_input('clear')


# This bot searches for <table> and </table> to start/stop recording data
class Bot:

def __init__(self,
line_added=False,
looking_for_start=True,
looking_for_end=False):

self.line_adding = line_added
self.looking_for_start = looking_for_start
self.looking_for_end = looking_for_end

def set_line_adding(self, bool):

self.line_adding = bool

def set_start_look(self, bool):

self. looking_for_start = bool

def set_end_look(self, bool):

self.looking_for_end = bool


if __name__ == '__main__':

# Start with a fresh Terminal emulator
clear_screen()

my_bot = Bot()

r = requests.get(url).text
all_r = r.split('\n')

for rs in all_r:

if my_bot.looking_for_start and table_id in rs:

my_bot.set_line_adding(True)
my_bot.set_end_look(True)
my_bot.set_start_look(False)

if my_bot.looking_for_end and table_end in rs:

my_bot.set_line_adding(False)
my_bot.looking_for_end(False)

if my_bot.line_adding:

all_lines.append(rs)


for lines in all_lines:
print(lines)

print('\n\n\n\n')
print(len(all_lines))
这将打印 6,551 行代码,这是我需要的两倍多。这通常没问题,因为下一步是清理与我的数据无关的行,但是,这不会产生整个表。我之前使用 BeautifulSoup 进行的另一次尝试(非常相似的过程)也没有从上述表格开始和停止。它看起来像这样:
from bs4 import BeautifulSoup
import requests
from os import system

url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'

# Declare an empty list to fill with lines of text
all_lines = list()


if __name__ == '__main__':

# Here we go, again...
_ = system('clear')

r = requests.get(url).text
soup = BeautifulSoup(r)
my_table = soup.find_all('table', {'id': table_id})

for current_line in my_table:

page_lines = str(current_line).split('\n')

for line in page_lines:
all_lines.append(line)

for line in all_lines:
print(line)

print('\n\n')
print(len(all_lines))

结果产生了 5,547 行。
我也尝试过使用 Pandas 和 Selenium,但我已经抓取了该代码。我希望通过展示我的两次“最佳”尝试,有人可能会看到我遗漏的一些明显问题。
如果我能在屏幕上获取数据,我会很高兴。我最终试图将这些数据转换为字典(将导出为 .json 文件),如下所示:
data = {
"Country": [country for country in countries],
"Total Cases": [case for case in total_cases],
"New Cases": [case for case in new_cases],
"Total Deaths": [death for death in total_deaths],
"New Deaths": [death for death in new_deaths],
"Total Recovered": [death for death in total_recovered],
"New Recovered": [death for death in new_recovered],
"Active Cases": [case for case in active_cases],
"Serious/Critical": [case for case in serious_critical],
"Total Cases/1M pop": [case for case in total_case_per_million],
"Deaths/1M pop": [death for death in deaths_per_million],
"Total Tests": [test for test in total_tests],
"Tests/1M pop": [test for test in tests_per_million],
"Population": [population for population in populations]
}
有什么建议吗?

最佳答案

该表包含许多其他信息。您可以获得第一个15 <td>一行中的单元格并去除前 8 行/后 8 行:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://www.worldometers.info/coronavirus/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

all_data = []
for tr in soup.select("#main_table_countries_today tr:has(td)")[8:-8]:
tds = [td.get_text(strip=True) for td in tr.select("td")][:15]
all_data.append(tds)

df = pd.DataFrame(
all_data,
columns=[
"#",
"Country",
"Total Cases",
"New Cases",
"Total Deaths",
"New Deaths",
"Total Recovered",
"New Recovered",
"Active Cases",
"Serious, Critical",
"Tot Cases/1M pop",
"Deaths/1M pop",
"Total Tests",
"Tests/1M pop",
"Population",
],
)
print(df)
打印:
       #                 Country Total Cases New Cases Total Deaths New Deaths Total Recovered New Recovered Active Cases Serious, Critical Tot Cases/1M pop Deaths/1M pop  Total Tests Tests/1M pop     Population
0 1 USA 35,745,024 629,315 29,666,117 5,449,592 11,516 107,311 1,889 529,679,820 1,590,160 333,098,437
1 2 India 31,693,625 +39,041 424,777 +393 30,846,509 +33,636 422,339 8,944 22,725 305 468,216,510 335,725 1,394,642,466
2 3 Brazil 19,917,855 556,437 18,619,542 741,876 8,318 92,991 2,598 55,034,721 256,943 214,190,490
3 4 Russia 6,288,677 +22,804 159,352 +789 5,625,890 +17,271 503,435 2,300 43,073 1,091 165,800,000 1,135,600 146,002,094

...

218 219 Samoa 3 3 0 15 199,837
219 220 Saint Helena 2 2 0 328 6,097
220 221 Micronesia 1 1 0 9 116,324
221 222 China 93,005 +75 4,636 87,347 +24 1,022 25 65 3 160,000,000 111,163 1,439,323,776

关于python - 在 Python 中抓取 <table>TABLE I NEED</table> 之间的所有文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68612714/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com