Here is an example of my dataframe
以下是我的数据帧示例
id pdf
1 https://ia802902.us.archive.org/10/items/EL103_L_1978_03_024_01_1_PF_03/EL103_L_1978_03_024_01_1_PF_03.pdf
2 https://ia801900.us.archive.org/31/items/EL103_L_1978_03_033_07_1_PF_05/EL103_L_1978_03_033_07_1_PF_05.pdf
3 https://ia802900.us.archive.org/35/items/EL105_L_1978_03_072_03_1_PF_05/EL105_L_1978_03_072_03_1_PF_05.pdf
I want to download each pdf that is in column ['pdf']. I tried the following code (source: https://www.geeksforgeeks.org/downloading-pdfs-with-python-using-requests-and-beautifulsoup/)
我想下载[‘pdf’]栏中的每个pdf。我尝试了以下代码(来源:https://www.geeksforgeeks.org/downloading-pdfs-with-python-using-requests-and-beautifulsoup/)
import requests
from bs4 import BeautifulSoup
for url in df["pdf"]:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
i = 0
for link in links:
if ('.pdf' in link.get('href', [])):
i += 1
print("Downloading file: ", i)
response = requests.get(link.get('href'))
pdf = open("C:/myfolder"+str(i)+".pdf", 'wb')
pdf.write(response.content)
pdf.close()
print("File ", i, " downloaded")
It starts running but it does not download any file. I would like to keep the original name of the pdf (for example: EL103_L_1978_03_024_01_1_PF_03.pdf). Any suggestion?
它开始运行,但不下载任何文件。我想保留pdf的原始名称(例如:EL103_L_1978_03_024_01_1_PF_03.pdf)。有什么建议吗?
更多回答
优秀答案推荐
You can use this example how to download the PDFs:
您可以使用此示例如何下载PDF:
import requests
for pdf_url in df["pdf"]:
file_name = pdf_url.split("/")[-1]
with open(file_name, "wb") as f_out:
print("Downloading", pdf_url)
f_out.write(requests.get(pdf_url).content)
Prints:
打印:
Downloading https://ia802902.us.archive.org/10/items/EL103_L_1978_03_024_01_1_PF_03/EL103_L_1978_03_024_01_1_PF_03.pdf
Downloading https://ia801900.us.archive.org/31/items/EL103_L_1978_03_033_07_1_PF_05/EL103_L_1978_03_033_07_1_PF_05.pdf
Downloading https://ia802900.us.archive.org/35/items/EL105_L_1978_03_072_03_1_PF_05/EL105_L_1978_03_072_03_1_PF_05.pdf
and saves them as:
并将它们保存为:
andrej@MyPC:~/app$ ls -alF *pdf
-rw-r--r-- 1 root root 792942 sep 10 22:54 EL103_L_1978_03_024_01_1_PF_03.pdf
-rw-r--r-- 1 root root 559170 sep 10 22:54 EL103_L_1978_03_033_07_1_PF_05.pdf
-rw-r--r-- 1 root root 935443 sep 10 22:54 EL105_L_1978_03_072_03_1_PF_05.pdf
更多回答
Thank you. I just did a small change with open("direction of folder"+file-name, "wb") as f_out
谢谢我只是做了一个小的变化与开放(“方向的文件夹”+文件名,“文件夹”)作为f_out
我是一名优秀的程序员,十分优秀!