- android - 多次调用 OnPrimaryClipChangedListener
- android - 无法更新 RecyclerView 中的 TextView 字段
- android.database.CursorIndexOutOfBoundsException : Index 0 requested, 光标大小为 0
- android - 使用 AppCompat 时,我们是否需要明确指定其 UI 组件(Spinner、EditText)颜色
我希望能够通过 python 从公共(public) Box 链接检索 pdf,但我不太确定如何做到这一点。以下是我希望能够下载的 pdf 类型的示例: https://fnn.app.box.com/s/ho73v0idqauzda1r477kj8g8okh72lje
我可以单击下载按钮或单击按钮在浏览器上获取可打印链接,但我无法在源 html 中找到此页面的链接。有没有办法以编程方式找到此链接?也许通过 selenium 或 requests 甚至通过 box API?
非常感谢您的帮助!
最佳答案
这是获取pdf下载链接的代码:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
def get_html(url, timeout = 15):
''' function returns html of url
usually html = urlopen(url) is enough but sometimes it doesn't work
also instead urllib.request you can use any other method to get html
code of url like urllib or urllib2 (just search it online), but I
think urllib.request comes with python installation'''
html = ''
try:
html = urlopen(url, None, timeout)
except:
url = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
html = urlopen(url, None, timeout)
except:
pass
return html
def get_soup(html):
''' function returns soup of html code
Beautiful Soup is a Python library for pulling data out of HTML
and XML files. It works with your favorite parser to provide idiomatic
ways of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.
more at https://www.crummy.com/software/BeautifulSoup/bs4/doc/'''
soup = BeautifulSoup(html, "lxml")
## if it doesn't work instead of using "lxml"
## you can use any of these options:
## soup = BeautifulSoup(html, "html.parser")
## soup = BeautifulSoup(html, "lxml-xml")
## soup = BeautifulSoup(html, "xml")
## soup = BeautifulSoup(markup, "html5lib")
return soup
def get_data_file_id(html):
'''function returns data_file_id which is found in html code'''
## to scrape website i suggest using BeautifulSoup,
## you can do it manually using html.read() which will give you
## html code as string and then you need to do some string searching
soup = get_soup(html)
## part of html code we are interested in is:
## <div class="preview" data-module="preview" data-file-id="69950302561" data-file-version-id="">
## we want to extract this data-file-id
## first we find div in which it's located in
classifier = {"class": 'preview'} ## classifier specifies div we are looking for
div = soup.find('div', classifier) ## we will get div which has class 'preview'
## now we can easily get data-file-id by using
data_file_id = div.get('data-file-id')
return data_file_id
## you can install BeautifulSoup from:
## on windows http://www.lfd.uci.edu/~gohlke/pythonlibs/
## or from https://pypi.python.org/pypi/beautifulsoup4/4.4.1
## official page is https://www.crummy.com/software/BeautifulSoup/
## if you don't want to use BeautifulSoup than you should do smotehing like this:
##
##html_str = str(html.read())
##search_for = 'div class="preview" data-module="preview" data-file-id="'
##start = html_str.find(search_for) + len(search_for)
##end = html_str.find('"', start)
##data_file_id = html_str[start : end]
##
## it may seem easier to do it than to use BeautifulSoup, but the problem is that
## if there is one more space in search_for or the order of div attributes is different
## or there sign " is used instead of ' and and vice versa this string searching
## won't work while BeautifulSoup will so I recommend using BeautifulSoup
def get_url_id(url):
''' function returns url_id which is last part of url'''
reverse_url = url[::-1]
start = len(url) - reverse_url.find('/') # start is position of last '/' in url
url_id = url[start:]
return url_id
def get_download_url(url_id, data_file_id):
''' function returns download_url'''
start = 'https://fnn.app.box.com/index.php?rm=box_download_shared_file&shared_name='
download_url = start + url_id + '&file_id=f_' + data_file_id
return download_url
url = 'https://fnn.app.box.com/s/ho73v0idqauzda1r477kj8g8okh72lje'
url = 'https://fnn.app.box.com/s/n74mnmrwyrmtiooqwppqjkrd1hhf3t3j'
html = get_html(url)
data_file_id = get_data_file_id(html) ## we need data_file_id to create download url
url_id = get_url_id(url) ## we need url_id to create download url
download_url = get_download_url(url_id, data_file_id)
## this actually isn't real download url
## you can get real url by using:
## real_download_url = get_html(download_url).geturl()
## but you will get really long url for your example it would be
## https://dl.boxcloud.com/d/1/4vx9ZWYeeQikW0KHUuO4okRjjQv3t6VGFTbMkh7weWQQc_tInOFR_1L_FuqVFovLqiycOLUDHu4o2U5EdZjmwnSmVuByY5DhpjmmdlizjaVjk6RMBbLcVhSt0ewtusDNL5tA8aiUKD1iIDlWCnXHJlcVzBc4aH3BXIEU65Ki1KdfZIlG7_jl8wuwP4MQG_yFw2sLWVDZbddJ50NLo2ElBthxy4EMSJ1auyvAWOp6ai2S4WPdqUDZ04PjOeCxQhvo3ufkt3og6Uw_s6oVVPryPUO3Pb2M4-Li5x9Cki882-WzjWUkBAPJwscVxTbDbu1b8GrR9P-5lv2I_DC4uPPamXb07f3Kp2kSJDVyy9rKbs16ATF3Wi2pOMMszMm0DVSg9SFfC6CCI0ISrkXZjEhWa_HIBuv_ptfQUUdJOMm9RmteDTstW37WgCCjT2Z22eFAfXVsFTOZBiaFVmicVAFkpB7QHyVkrfxdqpCcySEmt-KOxyjQOykx1HiC_WB2-aEFtEkCBHPX8BsG7tm10KRbSwzeGbp5YN1TJLxNlDzYZ1wVIKcD7AeoAzTjq0Brr8du0Vf67laJLuBVcZKBUhFNYM54UuOgL9USQDj8hpl5ew-W__VqYuOnAFOS18KVUTDsLODYcgLMzAylYg5pp-2IF1ipPXlbBOJgwNpYgUY0Bmnl6HaorNaRpmLVQflhs0h6wAXc7DqSNHhSnq5I_YbiQxM3pV8K8IWvpejYy3xKED5PM9HR_Sr1dnO0HtaL5PgfKcuiRCdCJjpk766LO0iNiRSWKHQ9lmdgA-AUHbQMMywLvW71rhIEea_jQ84elZdK1tK19zqPAAJ0sgT7LwdKCsT781sA90R4sRU07H825R5I3O1ygrdD-3pPArMf9bfrYyVmiZfI_yE_XiQ0OMXV9y13daMh65XkwETMAgWYwhs6RoTo3Kaa57hJjFT111lQVhjmLQF9AeqwXb0AB-Hu2AhN7tmvryRm7N2YLu6IMGLipsabJQnmp3mWqULh18gerlve9ZsOj0UyjsfGD4I0I6OhoOILsgI1k0yn8QEaVusHnKgXAtmi_JwXLN2hnP9YP20WjBLJ/download
## and we don't really care about real download url so i will use just download_url
print(download_url)
我还编写了下载该 pdf 的代码:
from urllib.request import Request, urlopen
def get_html(url, timeout = 15):
''' function returns html of url
usually html = urlopen(url) is enough but sometimes it doesn't work
also instead urllib.request you can use any other method to get html
code of url like urllib or urllib2 (just search it online), but I
think urllib.request comes with python installation'''
html = ''
try:
html = urlopen(url, None, timeout)
except:
url = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
html = urlopen(url, None, timeout)
except:
pass
return html
def get_current_path():
''' function returns path of folder in which python program is saved'''
try:
path = __file__
except:
try:
import sys
path = sys.argv[0]
except:
path = ''
if path:
if '\\' in path:
path = path.replace('\\', '/')
end = len(path) - path[::-1].find('/')
path = path[:end]
return path
def check_if_name_already_exists(name, path):
''' function checks if there is already existing pdf file
with same name in folder given by path.'''
try:
file = open(path+name+'.pdf', 'r')
file.close()
return True
except:
return False
def get_new_name(old_name, path):
''' functions ask user to enter new name for file and returns inputted name.'''
print('File with name "{}" already exist.'.format(old_name))
answer = input('Would you like to replace it (answer with "r")\nor create new one (answer with "n") ? ')
while answer not in 'rRnN':
print('Your answer is inconclusive')
print('Please answer again:')
print('if you would like to replece the existing file answer with "r"')
print('if you would like to create new one answer with "n"')
answer = input('Would you like to replace it (answer with "r")\n or create new one (answer with "n") ? ')
if answer in 'nN':
new_name = input('Enter new name for file: ')
if check_if_name_already_exists(new_name, path):
return get_new_name(new_name, path)
else:
return new_name
if answer in 'rR':
return old_name
def download_pdf(url, name = 'document1', path = None):
'''function downloads pdf file from its url
required argument is url of pdf file and
optional argument is name for saved pdf file and
optional argument path if you want to choose where is your file saved
variable path must look like:
'C:\\Users\\Computer name\\Desktop' or
'C:/Users/Computer name/Desktop' '''
# and not like
# 'C:\Users\Computer name\Desktop'
pdf = get_html(url)
name = name.replace('.pdf', '')
if path == None:
path = get_current_path()
if '\\' in path:
path = path.replace('\\', '/')
if path[-1] != '/':
path += '/'
if path:
check = check_if_name_already_exists(name, path)
if check:
if name == 'document1':
i = 2
name = 'document' + str(i)
while check_if_name_already_exists(name, path):
i += 1
name = 'document' + str(i)
else:
name = get_new_name(name, path)
file = open(path+name + '.pdf', 'wb')
else:
file = open(name + '.pdf', 'wb')
file.write(pdf.read())
file.close()
if path:
print(name + '.pdf file downloaded in folder "{}".'.format(path))
else:
print(name + '.pdf file downloaded.')
return
download_url = 'https://fnn.app.box.com/index.php?rm=box_download_shared_file&shared_name=n74mnmrwyrmtiooqwppqjkrd1hhf3t3j&file_id=f_53868474893'
download_pdf(download_url)
希望有帮助,如果有效请告诉我。
关于python - 可从公共(public)盒子链接下载,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38043729/
我有一个网站,并且我使用 javascript sdk 添加了“点赞”按钮。 这是代码 (function(d, s, id) { var js, fjs = d.g
我知道 HTML 是逐行读取的。当您链接多个 css 文件(如规范化文件和样式表文件)时,由于 CSS 重要性特异性和源顺序,样式表文件应链接在规范化文件之后。看起来这不会影响链接的 JavaScri
我正在使用官方 Bootstrap site 提供的 CDN 链接在我的网络应用程序中使用面板进行测试 在彻底检查我的代码后,面板没有显示。 但是我在 SO 上看到了类似的帖子并且 CDN 链接不同
这里是编码初学者。我正在尝试为我的移动设备网站设置断点,以便我的网站适合小屏幕。我只是想检查如果我缩小视口(viewport)的宽度,背景颜色是否会改变,但没有发生任何变化。也许我只是对一个简单的错误
举一个我想要的例子,想象一下这个字符串: $text = 'lorem ipsum About us lorem ipsum'; 如果此字符串包含一个 href 以 / 开头的 anchor 链接,则
如何链接到 LaTeX 文档的另一部分或子部分?这种链接的常规范式是什么,像[链接名称]那样写,或者像网页超链接那样写? 最佳答案 链接到另一个部分需要您的部分进行一些额外的标记。要使用的命令是: \
我有一个订单表,其中包含订单号、客户 ID 和代理 ID。然后有一个带有 id 的客户表和一个带有 id 的代理表。 我需要获取所有具有来自代理 ID 'a03' 和代理 ID 'a05' 的订单的客
假设我有: dic = {"z":"zv", "a":"av"} ## Why doesn't the following return a sorted list of keys? keys = d
我在尝试链接到外部库时得到了一些奇怪的结果。如果我从命令行运行以下命令: gcc fftwTest.c -I../extlib/fftw-3.3.5-dll32 -L../extlib/fftw-3.
我认为我没有正确理解 jQuery 链接。我正在遍历一个数组并尝试将 div 元素添加到我的包装器 CSS 类中,每个 div 元素都有一个“click”类和自定义 css top 和 left 属性
HTML 使用超级链接与网络上的另一个文档相连。几乎可以在所有的网页中找到链接。点击链接可以从一张页面跳转到另一张页面。 HTML 超链接(链接) HTML使用标签 a 来设置超文本链接。 超链
这个问题在这里已经有了答案: How do I link to part of a page? (hash?) (7 个答案) Scroll Automatically to the Bottom
我想创建一个 Docker Swarm 集群,运行一个 Elasticsearch 实例、一个 MongoDB 实例和一个 grails 应用程序,每个都在单独的机器上。我正在使用 Docker Ma
我正在尝试将 CakePHP HTML Linker 用于以下代码 Add Cuisine 由于 span 标签需要在 a 标签内。我无法根据需要获得输出。关于如何完成它的任何建议? 最佳答案 禁用链
大家好, 我最近开发了一个应用程序,很快就会提交到 App Store。我想免费提交这个应用程序,并想知道我是否可以实现一个带有 PayPal 捐赠标志的按钮,上面基本上写着“捐赠用于开发”或与此相关
我想尝试在 dlang 中使用 libuv。我下载了这样的 dlang 绑定(bind): git clone git@github.com:tamediadigital/libuv.git 现在我接
我有一个节点(节点 a),各种其他节点(节点 b/c/d/e)与之引用。 我可以创建一个带有参数的 View 作为我正在查看的节点(节点 a),并获取引用该节点的节点列表。 基本上在节点 a 查看节点
我正在尝试建立一个常见问题页面,上面有目录,下面有答案。我想点击目录中的一个问题,并在同一页面上链接到相应的答案。我如何在 CakePHP 中使用 $this->Html->link() 执行此操作方
在 WooCommerce 3.0+ 中,我使用 js 创建了一些选项卡,每个选项卡中包含来自不同类别的产品。我已经设法修改了简单产品的添加到购物车链接,其中点击了 addtocart 按钮它进入下一
Delphi 2007/2009 奇怪的问题在这里: 根据设计时定义的组件属性,是否可以在链接中包含文件或保留文件? 示例:如果我将 SomeProperty 保留为真,则在编译时,单元 SomeUn
我是一名优秀的程序员,十分优秀!