gpt4 book ai didi

python - beautifulsoup find_all 无法获取div数据

转载 作者:太空宇宙 更新时间:2023-11-03 16:24:59 24 4
gpt4 key购买 nike

我尝试从网站获取html数据,但data_table返回null并尝试跟踪代码,当我尝试获取 header 数据时,它将返回 html 上下文

    import requests
from bs4 import BeautifulSoup
import html.parser
from html.parser import HTMLParser
import time
from random import randint
import sys
from IPython.display import clear_output
import pymysql

links = ['https://www.ptt.cc/bbs/Gossiping/index'+str(i+1)+'.html' for i in range(10)]
data_links=[]

for link in links:
res = requests.get(link)
soup = BeautifulSoup(res.text.encode("utf-8"),"html.parser")
data_table = soup.findAll("div",{"id":"r-ent"})
print(data_table)

最佳答案

当您在浏览器中访问该页面时,您必须确认自己已年满 18 岁,然后才能看到实际内容,因此这就是您所看到的页面,您需要将帖子发送到 https://www.ptt.cc/ask/over18 包含数据 yes=yesfrom = "/bbs/Gossiping/index{the_number}.html",如果打印返回的源代码,您可以看到该表单。

<form action="/ask/over18" method="post">
<input type="hidden" name="from" value="/bbs/Gossiping/index1.html">
<div class="over18-button-container">
<button class="btn-big" type="submit" name="yes" value="yes">我同意,我已年滿十八歲<br><small>進入</small></button>
</div>
<div class="over18-button-container">
<button class="btn-big" type="submit" name="no" value="no">未滿十八歲或不同意本條款<br><small>離開</small></button>
</div>
</form>

页面上也没有r-ent,只有div:

import requests
from bs4 import BeautifulSoup

links = ['https://www.ptt.cc/bbs/Gossiping/index{}.html' for i in range(1,11)]
data_links = []
data = {"yes":"yes"}
head = {"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"}

for ind, link in enumerate(links, 1):
with requests.Session() as s:
data["from"] = "/bbs/Gossiping/index{}.html".format(ind)
s.post("https://www.ptt.cc/ask/over18", data=data, headers=head)
res = s.get(link, headers=head)
soup = BeautifulSoup(res.text,"html.parser")
data_divs= soup.select("div.r-ent")
print(data_divs)

上面的代码获取了所有带有r-ent类的div。

使用 session 发布一次可能就可以了,因为 cookie 会被存储,所以下面的代码应该可以正常工作。

links = ['https://www.ptt.cc/bbs/Gossiping/index{}.html' for i in range(1,11)]
data_links=[]
data = {"yes":"yes"}
head = {"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"}
with requests.Session() as s:
data["from"] = "/bbs/Gossiping/index1.html"
s.post("https://www.ptt.cc/ask/over18", data=data, headers=head)
for link in links:
res = s.get(link, headers=head)
BeautifulSoup(res.text,"html.parser")
data_divs= soup.select("div.r-ent")
print(data_divs)

关于python - beautifulsoup find_all 无法获取div数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38071406/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com