gpt4 book ai didi

Python HTML 解析器分页

转载 作者:太空宇宙 更新时间:2023-11-03 19:56:44 25 4
gpt4 key购买 nike

我是 python 新手,并且已经尝试使用 HTML 解析器成功做到了这一点,但我一直困惑于如何为网站底部的评论进行分页。

URL 位于 PasteBin 代码中,出于隐私原因,我在此线程中省略了 URL。

非常感谢任何帮助。

# Reviews Scrape

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'EXAMPLE.COM'

# opening up connection, grabbing, the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# HTML Parsing
page_soup = soup(page_html, "html.parser")

# Grabs each review
reviews = page_soup.findAll("div",{"class":"jdgm-rev jdgm-divider-top"})

filename = "compreviews.csv"
f = open(filename, "w")

headers = "Score, Title, Content\n"

f.write(headers)
# HTML Lookup Location per website and strips spacing
for container in reviews:
# score = container.div.div.span["data-score"]
score = container.findAll("span",{"data-score":True})
user_score = score[0].text.strip()

title_review = container.findAll("b",{"class":"jdgm-rev__title"})
user_title = title_review[0].text.strip()

content_review = container.findAll("div",{"class":"jdgm-rev__body"})
user_content = content_review[0].text.strip()

print("user_score:" + score[0]['data-score'])
print("user_title:" + user_title)
print("user_content:" + user_content)

f.write(score[0]['data-score'] + "," +user_title + "," +user_content + "\n")

f.close()

最佳答案

该页面使用查询字符串执行 xhr GET 请求来获取结果。该查询字符串包含每页评论和页码的参数。您可以发出一个初始请求,每页的最大评论数为 31,从返回的 json 中提取 html,然后获取页数;编写一个循环来运行所有页面并获取结果。下面的示例构造:

import requests
from bs4 import BeautifulSoup as bs

start_url = 'https://urlpart&page=1&per_page=31&product_id=someid'

with requests.Session() as s:
r = s.get(start_url).json()
soup = bs(r['html'], 'lxml')
print([i.text for i in soup.select('.jdgm-rev__author')])
print([i.text for i in soup.select('.jdgm-rev__title')])
total_pages = int(soup.select_one('.jdgm-paginate__last-page')['data-page'])

for page in range(2, total_pages + 1):
r = s.get(f'https://urlpart&page={page}&per_page=31&product_id=someid').json()
soup = bs(r['html'], 'lxml')
print([i.text for i in soup.select('.jdgm-rev__author')])
print([i.text for i in soup.select('.jdgm-rev__title')]) #etc

csv 数据帧示例

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

start_url = 'https://urlpart&page=1&per_page=31&product_id=someid'

authors = []
titles = []

with requests.Session() as s:
r = s.get(start_url).json()
soup = bs(r['html'], 'lxml')
authors.extend([i.text for i in soup.select('.jdgm-rev__author')])
titles.extend([i.text for i in soup.select('.jdgm-rev__title')])
total_pages = int(soup.select_one('.jdgm-paginate__last-page')['data-page'])

for page in range(2, total_pages + 1):
r = s.get(f'https://urlpart&page={page}&per_page=31&product_id=someid').json()
soup = bs(r['html'], 'lxml')
authors.extend([i.text for i in soup.select('.jdgm-rev__author')])
titles.extend([i.text for i in soup.select('.jdgm-rev__title')]) #etc

headers = ['Author','Title']
df = pd.DataFrame(zip(authors,titles), columns = headers)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8',index = False )

关于Python HTML 解析器分页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59494824/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com