gpt4 book ai didi

python - 从列表理解中发出并行请求

转载 作者:行者123 更新时间:2023-12-04 13:35:13 25 4
gpt4 key购买 nike

import requests
import time
from lxml import html

def parse_site():
return str(memoryview(''.join([f'---! {link.text_content()} !---\n{parse_fandom(link.xpath(".//a/@href")[0])}\n' for link in
html.fromstring(requests.get('https://archiveofourown.org/media').content).xpath('//*[@class="actions"]')]).encode('utf-8'))[:-1], 'utf-8')

def parse_fandom(url):
return ''.join([' '.join(f'{item.text_content()} |—| {item.xpath(".//a/@href")[0]}'.split()) + '\n' for item in
html.fromstring(requests.get(f'https://archiveofourown.org{url}').content).xpath('//*[contains(@class, "tags")]//li')])

if __name__ == '__main__':
start_time = time.time()
with open('test.txt', 'w+', encoding='utf-8') as f:
f.write(parse_site())
print("--- %s seconds ---" % (time.time() - start_time))
我正在通过网络抓取此站点以收集粉丝统计数据,但使用 requests.get() 连接到该站点可能需要 1-3 秒,使整个程序缓慢 18-22 秒。不知何故,我想在并行线程上发出这些请求,但是像 grequests 这样的模块需要一个分配的池来这样做,我还没有想出一种方法来在 list comprehension 内创建这样一个池.
列表的顺序对我来说并不重要,只要每个类别(在 parse_site() 中解析)与其子链接( parse_fandom(url) )之间存在层次结构即可。我想做的是:
[parallel_parse_fandom(url), parallel_parse_fandom(url2), parallel_parse_fandom(url3)]

[<All links within this fandom>, parallel_parse_fandom(url2), <All links within this fandom>]

return [<All links within this fandom>, <All links within this fandom>, <All links within this fandom>]
基于@Aditya 的 的解决方案
import requests
import time
from lxml import html
from concurrent.futures import ThreadPoolExecutor, as_completed

def parse_site():
with ThreadPoolExecutor(max_workers=12) as executor:
results = []
for result in as_completed([executor.submit(parse_fandom, url) for url in [[link.text_content(), link.xpath(".//a/@href")[0]] for link in
html.fromstring(requests.get('https://archiveofourown.org/media').content).xpath('//*[@class="actions"]')]]):
results.append(result)
return str(memoryview(''.join(item.result() for item in results).encode('utf-8'))[:-1], 'utf-8')

def parse_fandom(data):
return f'---! {data[0]} !---\n' + ''.join([' '.join(f'{item.text_content()} |—| {item.xpath(".//a/@href")[0]}'.split()) + '\n' for item in
html.fromstring(requests.get(f'https://archiveofourown.org{data[1]}').content).xpath('//*[contains(@class, "tags")]//li')])

if __name__ == '__main__':
with open('test.txt', 'w', encoding='utf-8') as f:
f.write(parse_site())

最佳答案

您可以尝试以下方法,如果服务器也可以处理它,它可以轻松地让您并行发出大量请求;

# it's just a wrapper around concurrent.futures ThreadPoolExecutor with a nice tqdm progress bar!
from tqdm.contrib.concurrent import thread_map

def chunk_list(lst, size):
"""
From SO only;
Yield successive n-sized chunks from list.
"""
for i in range(0, len(lst), size):
yield lst[i:i + size]

for idx, my_chunk in enumerate(chunk_list(huge_list, size=2**12)):
for response in thread_map(<which_func_to_call>, my_chunk, max_workers=your_cpu_cores+6)):
# which_func_to_call -> wrap the returned response json obj in this, etc
# do something with the response now..
# make sure to cache the chunk results as well

关于python - 从列表理解中发出并行请求,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62559893/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com