gpt4 book ai didi

python - 下载维基百科转储时出现 503 错误

转载 作者:太空宇宙 更新时间:2023-11-04 02:14:18 25 4
gpt4 key购买 nike

我有以下脚本可以下载(以及稍后处理)维基百科的综合浏览量转储。我在所有页面上都收到 503 错误(其 URL 是正确的)。

import argparse
import aiohttp
import asyncio
import async_timeout
import re

base_url = "http://dumps.wikimedia.org/other/pagecounts-raw/{year}/{year}-{month:02d}/pagecounts-{year}{month:02d}{day:02d}-{hour:02d}0000.gz"

async def downloadFile(semaphore, session, url):
try:
async with semaphore:
with async_timeout.timeout(10):
async with session.get(url) as remotefile:
if remotefile.status == 200:
data = await remotefile.read()
outfile = re.sub("/", "_", url[7:])
with open(outfile, 'wb') as fp:
print('Saving')
fp.write(data)
else:
print(remotefile.status)
return
except Exception as e:
print(e)
return

async def aux(urls):
sem = asyncio.Semaphore(10)
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
print(url)
task = asyncio.ensure_future(downloadFile(sem, session, url))
tasks.append(task)
await asyncio.gather(*tasks)


def main():
parser = argparse.ArgumentParser()
parser.add_argument("--year", type=int, default=2016)
parser.add_argument("--month", type=int, default=4)
parser.add_argument("--temp_folder", type=str)
args = parser.parse_args()

urls = []

for day in range(1, 32)[:3]:
for hour in range(24)[:2]:
urls.append(base_url.format(
year=args.year, month=args.month, day=day, hour=hour))

loop = asyncio.get_event_loop()
asyncio.ensure_future(aux(urls))
loop.run_until_complete(aux(urls))


if __name__ == "__main__":
main()

我得到的错误是:

<ClientResponse(https://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-04/pagecounts-20160402-000000.gz) [503 Service Temporarily Unavailable]>
<CIMultiDictProxy('Server': 'nginx/1.13.6', 'Date': 'Wed, 24 Oct 2018 21:27:58 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '213', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=106384710; includeSubDomains; preload')>

但这真的很奇怪,因为在我的 chrome 浏览器上复制粘贴相同的 url 就可以完成这项工作!

最佳答案

我玩过代码,我可以说以下内容:

  • 维基百科不允许每个 IP 有多个请求
  • 此 url 的超时 10 太低

要使您的代码正常工作:

  • asyncio.Semaphore(10) 更改为 asyncio.Semaphore(1)
  • async_timeout.timeout(10) 更改为 async_timeout.timeout(120)
  • 完全删除 asyncio.ensure_future(aux(urls)) 行,你不需要它,因为你将 aux(urls) 传递给 run_until_complete

成功下载单个存档的最终版本:

import argparse
import aiohttp
import asyncio
import async_timeout
import re

base_url = "http://dumps.wikimedia.org/other/pagecounts-raw/{year}/{year}-{month:02d}/pagecounts-{year}{month:02d}{day:02d}-{hour:02d}0000.gz"

async def downloadFile(semaphore, session, url):
try:
async with semaphore:
with async_timeout.timeout(120):
async with session.get(url, ssl=False) as remotefile:
if remotefile.status == 200:
data = await remotefile.read()
outfile = re.sub("/", "_", url[7:])
with open(outfile, 'wb') as fp:
print('Saving')
fp.write(data)
else:
print('status:', remotefile.status)
return
except Exception as e:
print('exception:', type(e), str(e))
return

async def aux(urls):
sem = asyncio.Semaphore(1)
tasks = []
async with aiohttp.ClientSession() as session:
for url in urls:
print('url:', url)
task = asyncio.ensure_future(downloadFile(sem, session, url))
tasks.append(task)
await asyncio.gather(*tasks)


def main():
parser = argparse.ArgumentParser()
parser.add_argument("--year", type=int, default=2016)
parser.add_argument("--month", type=int, default=4)
parser.add_argument("--temp_folder", type=str)
args = parser.parse_args()

urls = []

for day in range(1, 32)[:1]:
for hour in range(24)[:1]:
urls.append(base_url.format(
year=args.year, month=args.month, day=day, hour=hour))

loop = asyncio.get_event_loop()
loop.run_until_complete(aux(urls))


if __name__ == "__main__":
main()

关于python - 下载维基百科转储时出现 503 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52978264/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com