gpt4 book ai didi

python - 使用 beautifulsoup 抓取 Reddit 上的嵌套评论

转载 作者:太空宇宙 更新时间:2023-11-03 13:59:07 26 4
gpt4 key购买 nike

此代码获取页面。我的问题是我需要抓取用户评论的内容而不是评论的数量。它嵌套在评论数部分内,但我不确定如何访问该链接并解析和抓取用户评论。

request_list = []
id_list = [0]

for i in range(0,200,25):
response = requests.get("https://www.reddit.com/r/CryptoCurrency/?count="+str(i)+"&after="+str(id_list[-1]), headers = {'User-agent':'No Bot'})
soup = BeautifulSoup(response.content, 'lxml')
request_list.append(soup)
id_list.append(soup.find_all('div', attrs={'data-type': 'link'})[-1]['data-fullname'])
print(i, id_list)
if i%100 == 0:
time.sleep(1)

下面的代码我尝试编写一个应该访问嵌套注释的函数,但我不知道。

def extract_comment_contents(request_list):    
comment_contents_list = []
for i in request_list:
if response.status_code == 200:
for each in i.find_all('a', attrs={'data-inbound-url': '/r/CryptoCurrency/comments/'}):
comment_contents_list.append(each.text)
else:
print("Call failed at request ", i)
return comment_contents_list



fetch_comment_contents_list = extract_comment_contents(request_list)

print(fetch_comment_contents_list)

最佳答案

对于每个线程,您需要发送另一个请求来获取评论页面。评论页面的 URL 可以使用 soup.find_all('a', class_='bylink comments may-blank') 找到。这将给出评论页面的所有 a 标签。我将向您展示一个进入评论页面的示例。

r = requests.get('https://www.reddit.com/r/CryptoCurrency/?count=0&after=0')
soup = BeautifulSoup(r.text, 'lxml')

for comments_tag in soup.find_all('a', class_='bylink comments may-blank', href=True):
url = comments_tag['href']
r2 = requests.get(url)
soup = BeautifulSoup(r2.text, 'lxml')
# Your job is to parse this soup object and get all the comments.

关于python - 使用 beautifulsoup 抓取 Reddit 上的嵌套评论,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49389636/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com