gpt4 book ai didi

python - 循环浏览网站上的不同链接并抓取某些信息

转载 作者:太空宇宙 更新时间:2023-11-03 19:49:44 25 4
gpt4 key购买 nike

大家下午好,我希望有人可以帮助我解决与循环访问网站上的多个链接有关的问题。非常感谢您的帮助。我下面有这段代码,它从第一个链接获取我需要的信息,并创建我需要呈现它的 df。但网站上还有超过 6oo 个链接,我不知道如何处理。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://auctions.royaltyexchange.com/auctions_overview/"
html = urlopen("https://auctions.royaltyexchange.com/auctions/jay-zs-multi-platinum-empire-state-of-mind/?origin=overview&filter_value=overview")

soup = BeautifulSoup(html, 'lxml')
type(soup)
# Get the title
title = soup.title

title = soup.find('h1', class_='title -auction-page -dark').text.strip()
title
data = {'Name':['Title',title]}

df_title = pd.DataFrame(data)

irr = soup.find('span',attrs={'id':'current-irr'}).text.strip()
irr
data = {'value' : ['theoretical IRR',irr]}
df_irr = pd.DataFrame(data)

table = soup.find('table', class_='es-overview-table')
table_rows = table.find_all('tr')

res = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text.strip() for tr in td if tr.text.strip()]
if row:
res.append(row)

df_table = pd.DataFrame(pd.DataFrame(res).transpose())

df_final = pd.concat([df_title,df_irr ,df_table], axis=1, ignore_index = True)
df_final.head()

最佳答案

您可以使用它来主要获取所有页面上的所有链接。

from urllib.request import urlopen
import re
from bs4 import BeautifulSoup

raw_url = "https://auctions.royaltyexchange.com/"
def get_link(page_num):
global raw_url
link_ls = []
for page in range(1,page_num+1):
url = "https://auctions.royaltyexchange.com/auctions_overview/?origin=overview&page=" + str(page)
html = urlopen(url)
bs = BeautifulSoup(html, 'html.parser')

for link in bs.find('div',{'class':'-list'}).findAll('a',href=re.compile("^(/auctions/)")):
print(link.attrs['href'])
link_ls.append(raw_url + link.attrs['href'])
return link_ls

link_list = get_link(55) # the last page number

link_list

['https://auctions.royaltyexchange.com//auctions/hip-hop-royalties-danileighs-lil-bebe/?origin=overview&filter_value=overview',
'https://auctions.royaltyexchange.com//auctions/k-pop-publishing-featuring-exo-and-tvxq/?origin=overview&filter_value=overview',
'https://auctions.royaltyexchange.com//auctions/jay-zs-multi-platinum-empire-state-of-mind/?origin=overview&filter_value=overview',
'https://auctions.royaltyexchange.com//auctions/film-royalties-classic-comedy-trading-places/?origin=overview&filter_value=overview',
'https://auctions.royaltyexchange.com//auctions/ben-jerrys-cherry-garcia-trademark-royalties/?origin=overview&filter_value=overview',
'https://auctions.royaltyexchange.com//auctions/the-doobie-brothers-black-water-more/?origin=overview&filter_value=overview',
'https://auctions.royaltyexchange.com//auctions/dirty-dancings-ive-had-the-time-of-my-life/?origin=overview&filter_value=overview',
'https://auctions.royaltyexchange.com//auctions/multi-platinum-hip-hop-collection/?origin=overview&filter_value=overview',
...

在每个页面上,指定要提取的数据(例如标题、名称等)并告诉其数据帧的类型。

关于python - 循环浏览网站上的不同链接并抓取某些信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59918464/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com