gpt4 book ai didi

python - 用 mechanicalsoup 下载文件

转载 作者:太空宇宙 更新时间:2023-11-04 04:21:51 24 4
gpt4 key购买 nike

我想下载这个ONS webpage上的Excel文件在 Python 中使用 MechanicalSoup 包。我读过 MechanicalSoup documentation .我在 StackOverflow 和其他地方广泛搜索了一个示例,但没有成功。

我的尝试是:

# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup

# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup

# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()

browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

browser.download_link("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

在最后一行,我也尝试过:

browser.download_link(link="https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna",file="c:/test/filename.xls")

2019 年 1 月 25 日更新:感谢 AKX 在下方的评论,我已尝试过

browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))

在每种情况下,我都会收到错误:

mechanicalsoup.utils.LinkNotFoundError

但链接确实存在。尝试将其粘贴到您的地址栏以确认:

https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna

我做错了什么?

2019 年 1 月 25 日更新 2:感谢 AKX 在下面的回答,这是回答我问题的完整 MWE(发布给以后遇到同样困难的人) :

# Install dependencies
# pip install requests
# pip install BeautifulSoup4
# pip install MechanicalSoup

# Import libraries
import mechanicalsoup
import urllib.request
import requests
from bs4 import BeautifulSoup
import re

# Create a browser object that can collect cookies
browser = mechanicalsoup.StatefulBrowser()

browser.open("https://www.ons.gov.uk/economy/grossdomesticproductgdp/timeseries/l2kq/qna")

browser.download_link(link_text=".xls",file="c:/py/ONS_Data.xls" )

最佳答案

我没用过 Mechanical Soup,但是看了文档,

This function behaves similarly to follow_link()

follow_link说(强调我的)

  • If link is a bs4.element.Tag (i.e. from a previous call to links() or find_link()), then follow the link.
  • If link doesn’t have a href-attribute or is None, treat link as a url_regex and look it up with find_link(). Any additional arguments specified are forwarded to this function.

问号(除其他外)是正则表达式 (regex) 元字符,因此如果您想将它们用于 follow_link/download_link,则需要对它们进行转义:

import re
# ...
browser.download_link(re.escape("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna"))

但是,如果您访问的第一个页面不包含该直接链接,我不确定它是否有帮助。 (尽管先尝试。)

您可以使用浏览器的底层 requests session 来直接下载文件:

resp = browser.session.get("https://www.ons.gov.uk/generator?format=xls&uri=/economy/grossdomesticproductgdp/timeseries/l2kq/qna")
resp.raise_for_status() # raise an exception for 404, etc.
with open('filename.xls', 'wb') as outf:
outf.write(resp.content)

关于python - 用 mechanicalsoup 下载文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54352162/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com