gpt4 book ai didi

python - 如何从网络链接列表中检索 URL 和 URL 中的数据

转载 作者:太空宇宙 更新时间:2023-11-03 20:30:58 24 4
gpt4 key购买 nike

“您好,我对网络抓取还很陌生。我最近检索了一个网络链接列表,这些链接中的 URL 包含表中的数据。我计划抓取数据,但似乎无法抓取获取 URL。非常感谢任何形式的帮助”

“网页链接列表是

https://aviation-safety.net/database/dblist.php?Year=1919

https://aviation-safety.net/database/dblist.php?Year=1920

https://aviation-safety.net/database/dblist.php?Year=1921

https://aviation-safety.net/database/dblist.php?Year=1922

https://aviation-safety.net/database/dblist.php?Year=2019

“从链接列表中,我计划

a.获取这些链接中的 URL

https://aviation-safety.net/database/record.php?id=19190802-0

https://aviation-safety.net/database/record.php?id=19190811-0

https://aviation-safety.net/database/record.php?id=19200223-0

“b. 从每个 URL 内的表中获取数据(例如,事件日期、事件时间、类型、运算符(operator)、注册、MSN、首次航类、分类)”

    #Get the list of weblinks

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

headers = {'insert user agent'}

#start of code

mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable

datatable = getAndParseURL(mainurl)

#go through the content and grab the URLs

links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)

#check if links are in dataframe

df = pd.DataFrame(links, columns=['url'])

df.head(10)

#save the links to a csv

df.to_csv('aviationsafetyyearlinks.csv')


#from the csv read each web-link and get URLs within each link

import csv
from urllib.request import urlopen

contents = []
df = pd.read_csv('aviationsafetyyearlinks.csv')

urls = df['url']
for url in urls:
contents.append(url)
for url in contents:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
addtable = soup.find_all('a', href = True)

“我只能获取网络链接列表,无法获取 URL 或这些网络链接中的数据。代码不断显示数组不太确定我的代码哪里错了,感谢您的帮助,并提前非常感谢。”

最佳答案

请求页面时。添加用户代理。

headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
mainurl = "https://aviation-safety.net/database/dblist.php?Year=1919"
def getAndParseURL(mainurl):
result = requests.get(mainurl,headers=headers)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.select('a[href*="database/record"]')
return datatable

print(getAndParseURL(mainurl))

关于python - 如何从网络链接列表中检索 URL 和 URL 中的数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57511320/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com