gpt4 book ai didi

python - 我如何使用 Python 脚本从网站上获取 pdf 链接

转载 作者:行者123 更新时间:2023-11-28 20:52:49 25 4
gpt4 key购买 nike

我经常需要从网站上下载 pdf,但有时它们不在一页上。他们把链接分页了,我必须点击每一页才能获取链接。

我正在学习 python,我想编写一些脚本,我可以在其中放置 weburl 并从该网站提取 pdf 链接。

我是 python 的新手所以任何人都可以给我指导我该怎么做

最佳答案

urllib2 非常简单, urlparselxml .由于您是 Python 新手,我对事情的评论更加详细:

# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse

# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'

# fetch the page
res = urllib2.urlopen(base_url)

# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())

# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}

# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(@href, "\.pdf$", "i")]', namespaces=ns):

# print the href, joining it to the base_url
print urlparse.urljoin(base_url, node.attrib['href'])

结果:

http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...

关于python - 我如何使用 Python 脚本从网站上获取 pdf 链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6222911/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com