python-3.x - 也许编码错误 : Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x0000018F09F334A8>'-6ren

python-3.x - 也许编码错误 : Error sending result: ''

转载作者：行者123 更新时间：2023-12-01 02:39:11

使用多处理下载文件时出现以下错误。我正在下载维基百科页面浏览量，他们按小时计算，因此可能包含大量下载。

关于为什么会导致此错误的任何建议和 如何解决 ?谢谢

MaybeEncodingError: Error sending result: ''. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object",)'

import fnmatch
import requests
import urllib.request
from bs4 import BeautifulSoup
import multiprocessing as mp

def download_it(download_file):
    global path_to_save_document
    filename = download_file[download_file.rfind("/")+1:]
    save_file_w_submission_path = path_to_save_document + filename
    request = urllib.request.Request(download_file)
    response = urllib.request.urlopen(request)
    data_content = response.read()
    with open(save_file_w_submission_path, 'wb') as wf:    
        wf.write(data_content)
    print(save_file_w_submission_path)  

pattern = r'*200801*'
url_to_download = r'https://dumps.wikimedia.org/other/pagecounts-raw/'
path_to_save_document = r'D:\Users\Jonathan\Desktop\Wikipedia\\'    

def main():
    global pattern
    global url_to_download
    r  = requests.get(url_to_download)
    data = r.text
    soup = BeautifulSoup(data,features="lxml")

    list_of_href_year = []
    for i in range(2):
        if i == 0:
            for link in soup.find_all('a'):
                lien = link.get('href')
                if len(lien) == 4:
                    list_of_href_year.append(url_to_download + lien + '/')
        elif i == 1:
            list_of_href_months = [] 
            list_of_href_pageviews = []        
            for loh in list_of_href_year: 
                r  = requests.get(loh)
                data = r.text
                soup = BeautifulSoup(data,features="lxml")   
                for link in soup.find_all('a'):
                    lien = link.get('href')
                    if len(lien) == 7:
                        list_of_href_months.append(loh + lien + '/')
                if not list_of_href_months:
                   continue
                for lohp in list_of_href_months: 
                    r  = requests.get(lohp)
                    data = r.text
                    soup = BeautifulSoup(data,features="lxml")              
                    for link in soup.find_all('a'):
                        lien = link.get('href')
                        if "pagecounts" in lien:
                            list_of_href_pageviews.append(lohp + lien)       

    matching_list_of_href = fnmatch.filter(list_of_href_pageviews, pattern)   
    matching_list_of_href.sort()
    with mp.Pool(mp.cpu_count()) as p:
        print(p.map(download_it, matching_list_of_href))

if __name__ == '__main__':
    main()

最佳答案

正如 Darkonaut 所提议的那样。我改用了多线程。
例子:

from multiprocessing.dummy import Pool as ThreadPool 

'''This function is used for the download the files using multi threading'''    
def multithread_download_files_func(self,download_file):
    try:
        filename = download_file[download_file.rfind("/")+1:]
        save_file_w_submission_path = self.ptsf + filename
        '''Check if the download doesn't already exists. If not, proceed otherwise skip'''
        if not os.path.exists(save_file_w_submission_path):
            data_content = None
            try:
                '''Lets download the file'''
                request = urllib.request.Request(download_file)
                response = urllib.request.urlopen(request)
                data_content = response.read()     
            except urllib.error.HTTPError:
                '''We will do a retry on the download if the server is temporarily unavailable'''
                retries = 1
                success = False
                while not success:
                    try:
                        '''Make another request if the previous one failed'''
                        response = urllib.request.urlopen(download_file)
                        data_content = response.read()                        
                        success = True
                    except Exception:
                        '''We will make the program wait a bit before sending another request to download the file'''
                        wait = retries * 5;
                        time.sleep(wait)
                        retries += 1 
            except Exception as e:
                print(str(e))   
            '''If the response data is not empty, we will write as a new file and stored in the data lake folder'''                     
            if data_content:
                with open(save_file_w_submission_path, 'wb') as wf:    
                    wf.write(data_content)
                print(self.present_extract_RC_from_RS + filename)                   
    except Exception as e:
        print('funct multithread_download_files_func' + str(e))

'''This function is used as a wrapper before using multi threading in order to download the files to be stored in the Data Lake'''            
def download_files(self,filter_files,url_to_download,path_to_save_file):
    try:
        self.ptsf = path_to_save_file = path_to_save_file + 'Step 1 - Data Lake\Wikipedia Pagecounts\\'
        filter_files_df = filter_files 
        self.filter_pattern = filter_files       
        self.present_extract_RC_from_RS = 'WK Downloaded->           ' 
        
        if filter_files_df == '*':
            '''We will create a string of all the years concatenated together for later use in this program'''
            reddit_years = [2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018]
            filter_files_df = ''
            '''Go through the years from 2005 to 2018'''
            for idx, ry in enumerate(reddit_years):
                filter_files_df += '*' + str(ry) + '*'
                if (idx != len(reddit_years)-1):
                    filter_files_df += '&'   
                    
        download_filter = list([x.strip() for x in filter_files_df.split('&')])
        download_filter.sort()
        
        '''If folder doesn't exist, create one'''
        if not os.path.exists(os.path.dirname(self.ptsf)):
            os.makedirs(os.path.dirname(self.ptsf))       
        
        '''We will get the website HTML elements using beautifulsoup library'''
        r  = requests.get(url_to_download)
        data = r.text
        soup = BeautifulSoup(data,features="lxml")
        
        list_of_href_year = []
        for i in range(2):
            if i == 0:
                '''Lets get all href available on this particular page. The first page is the year page'''
                for link0 in soup.find_all('a'):
                    lien0 = link0.get('href')
                    '''We will check if the length is 4 which corresponds to a year'''
                    if len(lien0) == 4:
                        list_of_href_year.append(url_to_download + lien0 + '/')
                        
            elif i == 1:
                list_of_href_months = [] 
                list_of_href_pageviews = []        
                for loh in list_of_href_year: 
                    r1  = requests.get(loh)
                    data1 = r1.text
                    '''Get the webpage HTML Tags'''
                    soup1 = BeautifulSoup(data1,features="lxml")   
                    for link1 in soup1.find_all('a'):
                        lien1 = link1.get('href')
                        '''We will check if the length is 7 which corresponds to the year and month'''
                        if len(lien1) == 7:
                            list_of_href_months.append(loh + lien1 + '/')                                            
                for lohm in list_of_href_months: 
                    r2  = requests.get(lohm)
                    data2 = r2.text
                    '''Get the webpage HTML Tags'''
                    soup2 = BeautifulSoup(data2,features="lxml")              
                    for link2 in soup2.find_all('a'): 
                        lien2 = link2.get('href')
                        '''We will now get all href that contains pagecounts in their name. We will have the files based on Time per hour. So 24 hrs is 24 files
                        and per year is 24*365=8760 files in minimum'''                            
                        if "pagecounts" in lien2:
                            list_of_href_pageviews.append(lohm + lien2)      
     
        existing_file_list = []
        for file in os.listdir(self.ptsf):
             filename = os.fsdecode(file)     
             existing_file_list.append(filename)  
         
        '''Filter the links'''
        matching_fnmatch_list = []
        if filter_files != '':
            for dfilter in download_filter:
                fnmatch_list = fnmatch.filter(list_of_href_pageviews, dfilter) 
                i = 0
                for fnl in fnmatch_list:
                    '''Break for demo purpose only'''
                    if self.limit_record != 0:
                        if (i == self.limit_record) and (i != 0):
                            break
                    i += 1
                    matching_fnmatch_list.append(fnl) 
        
        '''If the user stated a filter, we will try to remove the files which are outside that filter in the list'''
        to_remove = []
        for efl in existing_file_list:
            for mloh in matching_fnmatch_list:
                if efl in mloh:         
                    to_remove.append(mloh)
        
        '''Lets remove the files which has been found outside the filter'''
        for tr in to_remove:
            matching_fnmatch_list.remove(tr)   
            
        matching_fnmatch_list.sort()    
          
        '''Multi Threading of 200'''
        p = ThreadPool(200)
        p.map(self.multithread_download_files_func, matching_fnmatch_list)
    except Exception as e:
        print('funct download_files' + str(e))

关于python-3.x - 也许编码错误 : Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x0000018F09F334A8>' ，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55131894/

文章推荐： javascript - Angular - 如何从 Firebase DB 中删除数据

文章推荐： python - 合并数据帧并仅保留不匹配的条目

文章推荐： python - Python 列表推导式中的 Lambda 表达式

文章推荐： .net - Entity Framework 中的第二个 self 关系

Python multiprocessing 和 multiprocessing.Queue
我正在尝试使用多处理和队列实现生产者-消费者场景；主进程是生产者，两个子进程使用队列中的数据。这在没有任何异常发生的情况下有效，但问题是我希望能够在工作人员死亡时重新启动他们(kill -9 wor
Python multiprocessing RemoteManager 下的一个 multiprocessing.Process
我试图在一个管理进程下启动一个数据队列服务器(这样它以后可以变成一个服务)，虽然数据队列服务器功能在主进程中工作正常，但它在一个进程中不起作用使用 multiprocessing.Process 创建
multiprocessing - Julia 等价于 Python multiprocessing.Pool.map
我的多处理需求非常简单:我从事机器学习工作，有时我需要评估多个数据集中的一个算法，或者一个数据集中的多个算法，等等。我只需要运行一个带有一些参数的函数并获取一个数字。我不需要 RPC、共享数据，什么
python - multiprocessing.Process() 或 multiprocessing.Pool() 会更均匀地分布在核心之间吗？
创建进程池或简单地遍历一个进程以创建更多进程之间有任何区别(以任何方式)吗？这有什么区别？: pool = multiprocessing.Pool(5) pool.apply_async(work
python - multiprocessing.Semaphore 和 multiprocessing.BoundedSemaphore 有什么区别？
multiprocessing.BoundedSemaphore(3) 与 multiprocessing.Sempahore(3) 有何不同？我希望 multiprocessing.Bounded
python - multiprocessing.Pipe 比 multiprocessing.Queue 还要慢？
我尝试通过 multiprocessing 包中的 Queue 对 Pipe 的速度进行基准测试。我认为 Pipe 会更快，因为 Queue 在内部使用 Pipe。奇怪的是，Pipe 在发送大型 n
Python multiprocessing.Queue 与 multiprocessing.manager().Queue()
我有这样一个简单的任务: def worker(queue): while True: try: _ = queue.get_nowait()
python - 为什么我可以将实例方法传递给 multiprocessing.Process，而不是 multiprocessing.Pool？
我正在尝试编写一个与 multiprocessing.Pool 同时应用函数的应用程序。我希望这个函数成为一个实例方法(所以我可以在不同的子类中以不同的方式定义它)。这似乎是不可能的；正如我在其他地方
Python2 : multiprocessing. dummy.Pool 与 multiprocessing.pool.ThreadPool
在 python 2 中，multiprocessing.dummy.Pool 和 multiprocessing.pool.ThreadPool 之间有什么区别吗？源代码似乎暗示它们是相同的。最佳
python - dask.multiprocessing 或 pandas + multiprocessing.pool : what's the difference?
我正在开发一个用于财务目的的模型。我将整个 S&P500 组件放在一个文件夹中，存储了尽可能多的 .hdf 文件。每个 .hdf 文件都有自己的多索引(年-周-分)。顺序代码示例(非并行化): im
python - 在 multiprocessing pool.map_async() 中处理 multiprocessing.TimeoutError
到目前为止，我是这样做的: rets=set(pool.map_async(my_callback, args.hosts).get(60*4)) 如果超时，我会得到一个异常: File "/usr
python - multiprocessing.Pool.apply 和 multiprocessing.Pool.apply_async 的目的
参见下面的示例和执行结果: #!/usr/bin/env python3.4 from multiprocessing import Pool import time import os def in
python - 创建使用 Multiprocessing 和 Multiprocessing.Queues 的 linux 守护进程
我的任务是监听 UDP 数据报，对其进行解码(数据报具有二进制信息)，将解码后的信息放入字典中，将字典转储为 json 字符串，然后将 json 字符串发送到远程服务器(ActiveMQ)。解码和发
multiprocessing - 为什么在 Python3.8+ "fork"中使用 "spawn"有效但使用 `multiprocessing` 失败？
我在 macOS 上工作，最近被 Python 3.8 多处理中“fork”到“spawn”的变化所困扰(参见 doc )。下面显示了一个简化的工作示例，其中使用“fork”成功但使用“spawn”失
python - 为什么 multiprocessing.Queue 有一个小的延迟，而(显然)multiprocessing.Pipe 却没有？
multiprocessing.Queue 的文档指出从项目入队到其腌制表示刷新到底层管道之间存在一点延迟。显然，您可以将一个项目直接放入管道中(它没有说明其他情况，并且暗示情况就是如此)。为什么管
python - 为什么 multiprocessing.Pool 和 multiprocessing.Process 在 Linux 中的表现如此不同
我运行了一些测试代码来检查在 Linux 中使用 Pool 和 Process 的性能。我正在使用 Python 2.7。 multiprocessing.Pool 的源代码似乎显示它正在使用 mul
具有 multiprocessing.Manager 的 Python multiprocessing.Process 对象在 Windows 任务管理器中创建多个多处理分支
我在 Windows Standard Embedded 7 上运行 python 3.4.3。我有一个继承 multiprocessing.Process 的类。在类的 run 方法中，我为进程对
python - 子类 multiprocessing.Process 但不调用 multiprocessing.Process 的 __init__ 方法
我知道multiprocessing.Process类似于 threading.Thread当我子类 multiprocessing.Process 时要创建一个进程，我发现我不必调用 __init_
multiprocessing - 在多处理器系统中禁用中断的过程是什么？
我有教科书声明说在多处理器系统中不建议禁用中断，并且会花费太多时间。但我不明白这一点，谁能告诉我多处理器系统禁用中断的过程？谢谢最佳答案在 x86(和其他架构，AFAIK)上，启用/禁用中断是基于
Python Multiprocessing - 进程数
我正在执行下面的代码并且它工作正常，但它不会产生不同的进程，而是有时所有都在同一个进程中运行，有时 2 个在一个进程中运行。我正在使用 4 cpu 机器。这段代码有什么问题？ def f(values

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python-3.x - 也许编码错误 : Error sending result: ''