gpt4 book ai didi

Python 2.7 - 使用 ajax 搜索网页上的特定 URL

转载 作者:行者123 更新时间:2023-11-28 22:33:09 24 4
gpt4 key购买 nike

我必须检索嵌套在网页中的 URL。我尝试了以下代码,但没有找到主链接(指向 PDF)的 URL。

import urllib2
from bs4 import BeautifulSoup

url = "http://www.cmc.gv.ao/sites/main/pt/Paginas/genericFileList.aspx?mid=9&smid=69&FilterField1=TipoConteudo_x003A_Code&FilterValue1=ENTREG"

conn = urllib2.urlopen(url)
html = conn.read()

soup = BeautifulSoup(html)
links = soup.find_all('a')

for tag in links:
link = tag.get('href',None)
if link is not None:
print link

我想找的网址是网页上的主链接:

http://www.cmc.gv.ao/sites/main/pt/Lists/CMC%20%20PublicaesFicheiros/Attachments/89/Lista%20de%20Institui%C3%A7%C3%B5es%20Registadas%20(actualizado%2024.10.16).pdf

在 bs4 文档中,它说 find_all() 方法会查看标记的后代(直接子代、直接子代的子代等)并检索所有 与您的过滤器匹配的后代。 p>

如何从网页中获取 URL?

最佳答案

pdf 路径 是使用 ajax 请求检索的,您需要做一些工作来模仿该请求:

import urllib2

from bs4 import BeautifulSoup
import re

url = "http://www.cmc.gv.ao/sites/main/pt/Paginas/genericFileList.aspx?mid=9&smid=69&FilterField1=TipoConteudo_x003A_Code&FilterValue1=ENTREG"

conn = urllib2.urlopen(url)
html = conn.read()

# we need to pass in the getbyid value which we parse later
attach = "http://www.cmc.gv.ao/sites/main/pt/_api/web/lists/getbyid('{}')/items(89)/AttachmentFiles"

soup = BeautifulSoup(html)

# the getbyid is contained inside a script tag, this will pull what er need from it.
patt = re.compile('ctx.editFormUrl\s+=\s+"(.*?)"')

# find that script.
scr = soup.find("script",text=re.compile("ctx.editFormUrl"))

# line we are getting looks like ctx.editFormUrl = "http://www.cmc.gv.ao/sites/main/pt/_layouts/15/listform.aspx?PageType=6&ListId=%7BC0527FB1%2D00D9%2D4BCF%2D8FFC%2DDFCAA9E9E51D%7D";
# we need the ListId

ctx = patt.search(scr.text).group(1)

# pull ListId, and pass it to url
soup2 = BeautifulSoup(urllib2.urlopen(attach.format(ctx.rsplit("=")[-1])).read())

# ^^ returns xml, we need to find the pdf path from that, it starts with /sites/main/pt/List.
pdf_path = soup2.find(text=re.compile("^/sites/main/pt/List"))

然后你需要加入到base url:

from urlparse import urljoin
# join our parsed path to the base
full_url = urljoin("http://www.cmc.gv.ao", pdf_path)
print(full_url)

我们还需要引用和编码:

from urllib import quote
from urlparse import urljoin

# handle non-ascii and encode
full_url = urljoin("http://www.cmc.gv.ao", quote(pdf_path.encode("utf-8")))

最后写:

from urlparse import urljoin
from urllib import quote

full_url = urljoin("http://www.cmc.gv.ao", quote(pdf_path.encode("utf-8")))
from os.path import basename
with open(basename(pdf_path.encode("utf-8")), "wb") as f:
f.writelines(urllib2.urlopen(full_url))

这将为您提供一个名为 Lista de Instituições Registadas (actualizado 24.10.16).pdf

的 pdf 文件

如果您使用请求,它会为您完成很多工作:

import requests
from bs4 import BeautifulSoup
import re
from urlparse import urljoin
from os.path import basename

url = "http://www.cmc.gv.ao/sites/main/pt/Paginas/genericFileList.aspx?mid=9&smid=69&FilterField1=TipoConteudo_x003A_Code&FilterValue1=ENTREG"

conn = requests.get(url)
html = conn.content
attach = "http://www.cmc.gv.ao/sites/main/pt/_api/web/lists/getbyid('{}')/items(89)/AttachmentFiles"
soup = BeautifulSoup(html)
links = soup.find_all('a')
patt = re.compile('ctx.editFormUrl\s+=\s+"(.*?)"')
scr = soup.find("script",text=re.compile("ctx.editFormUrl"))

ctx = patt.search(scr.text).group(1)

soup2 = BeautifulSoup(requests.get(attach.format(ctx.rsplit("=")[-1])).content)

pdf_path = soup2.find(text=re.compile("/sites/main/pt/List"))

full_url = urljoin("http://www.cmc.gv.ao", pdf_path.encode("utf-8"))

with open(basename(pdf_path.encode("utf-8")), "wb") as f:
f.writelines(requests.get(full_url))

关于Python 2.7 - 使用 ajax 搜索网页上的特定 URL,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40218463/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com