gpt4 book ai didi

python - BeautifulSoup 网络抓取 dentry 论坛目标标题

转载 作者:行者123 更新时间:2023-12-05 04:19:14 25 4
gpt4 key购买 nike

我正在尝试抓取论坛以生成用于分析的数据集。

选择主题时,类名有一个尾随的唯一编号,例如:

我如何抓取并返回包含所有类名的字符串列表,例如:

["structItem structItem--thread js-inlineModContainer js-threadListItem-00001", 
"structItem structItem--thread js-inlineModContainer js-threadListItem-00394",
"structItem structItem--thread js-inlineModContainer js-threadListItem-00045"...]
from bs4 import BeautifulSoup
import requests

url = "https://www.dentistry-forums.com/forums/periodontics.11/"
result = requests.get(url).text
doc = BeautifulSoup(result, "html.parser")
bbody = doc.body
bbody = bbody.find('div', attrs={'class':"p-pageWrapper"})
bbody = bbody.find('div', attrs={'class':"p-body"})
bbody = bbody.find('div', attrs={'class':"p-body-inner"})
bbody = bbody.find('div', attrs={'class':"p-body-main"})
bbody = bbody.find('div', attrs={'class':"p-body-content"})
bbody = bbody.find('div', attrs={'class':"p-body-pageContent"})
bbody = bbody.find('div', attrs={'class':"block"})
bbody = bbody.find('div', attrs={'class':"block-container"})
bbody = bbody.find('div', attrs={'class':"block-body"})
bbody = bbody.find('div', attrs={'class':"structItemContainer"})
bbody = bbody.find('div', attrs={'class':"structItemContainer-group js-threadList"})
print(bbody.prettify())

最佳答案

要获取线程的标题、它们的 URL 和类名,您可以使用下一个示例:

import requests
from bs4 import BeautifulSoup

url = "https://www.dentistry-forums.com/forums/periodontics.11/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for a in soup.select(".structItem-title a"):
class_names = " ".join(
a.find_previous(attrs={"data-author": True})["class"]
)

print("Title:", a.text)
print("URL:", "https://www.dentistry-forums.com" + a["href"])
print("Classnames:", class_names)
print()

打印:


...

Title: Deep Cleaning
URL: https://www.dentistry-forums.com/threads/deep-cleaning.26561/
Classnames: structItem structItem--thread js-inlineModContainer js-threadListItem-26561

Title: Deep Cleaning Charges
URL: https://www.dentistry-forums.com/threads/deep-cleaning-charges.26548/
Classnames: structItem structItem--thread js-inlineModContainer js-threadListItem-26548

Title: Advanced GD and loose teeth
URL: https://www.dentistry-forums.com/threads/advanced-gd-and-loose-teeth.26519/
Classnames: structItem structItem--thread js-inlineModContainer js-threadListItem-26519

Title: Recurrent thrush centered around one tooth
URL: https://www.dentistry-forums.com/threads/recurrent-thrush-centered-around-one-tooth.26513/
Classnames: structItem structItem--thread js-inlineModContainer js-threadListItem-26513

关于python - BeautifulSoup 网络抓取 dentry 论坛目标标题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74889791/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com