gpt4 book ai didi

python - 无法检索链接和子链接

转载 作者:行者123 更新时间:2023-12-03 01:47:47 26 4
gpt4 key购买 nike

我是python&美汤的新手,需要网络剪贴所有链接以在 Elasticsearch 中对其进行索引,我正在使用下面的代码来获取信息页面内的所有链接/子链接,但无法检索到它。

 from bs4 import BeautifulSoup
try:
import urllib.request as urllib2
except ImportError:
import urllib2

urlFile = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/index.jsp#com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html")

urlHtml = urlFile.read()
urlFile.close()

soup = BeautifulSoup(urlHtml,"html.parser")
urlAll = soup.find_all("a")
for links in soup.find_all('a'):
print (links.get('href'))

无法获取任何链接/子链接,因为print()没有给出任何o / p

请提供一些指示。

最佳答案

您想要的数据是通过ajax调用加载的。

更换
http://pubs.vmware.com/sddc-mgr-12/index.jsp#com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html

http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment
并将find_all元素类型更改为node:

from bs4 import BeautifulSoup
try:
import urllib.request as urllib2
except ImportError:
import urllib2

urlFile = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment")

urlHtml = urlFile.read()
urlFile.close()

soup = BeautifulSoup(urlHtml,"html.parser")
for links in soup.find_all('node'):
print (links.get('href'))

哪个输出:
../topic/com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html
../topic/com.vmware.vcf.ovdeploy.doc_21/GUID-F2DCF1B2-4EF6-444E-80BA-8F529A6D0725.html
../topic/com.vmware.vcf.admin.doc_211/GUID-D5A44DAA-866D-47C9-B1FB-BF9761F97E36.html
../topic/com.vmware.ICbase/PDF/ic_pdf.html

请注意,每次单击左侧面板项目时,都会触发ajax调用以填充列表。例如:
http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment?toc=/com.vmware.evosddc.via.doc_211/toc.xml

以这个特定的url片段为例: com.vmware.evosddc.via.doc_211-您看到您需要从第一个输出中获取该部分,以获取第二个输出,依此类推。

例:
soup = BeautifulSoup(urlHtml,"html.parser")
for links in soup.find_all('node'):
child_url = links.get('href').replace("../topic/", "")
child = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment?toc=/" + child_url[0:child_url.index("/")])
print (child.read())
#print (links.get('href'))

哪个输出
<?xml version="1.0" encoding="UTF-8"?>
<tree_data>
<node
path="0"
title="VIA User&apos;s Guide"
id="/com.vmware.evosddc.via.doc_211/toc.xml"
href="../topic/com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html"
image="toc_closed">
</node>

...

关于python - 无法检索链接和子链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42511164/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com