gpt4 book ai didi

python - 将gzip文件保存在应用于rdd的函数中

转载 作者:行者123 更新时间:2023-12-02 20:49:42 26 4
gpt4 key购买 nike

我想以分布式方式下载一堆gzip文件。我创建了一个包含所有文件URL的列表,并使用spark对其进行了并行化处理。使用此rdd上的 map ,我下载了当前文件。然后,我想将其保存在我的hdfs中,以便重新打开它并使用boto库将其重新保存在amazones3中。

例如,这是我的代码,我只是尝试下载该文件并将其保存在我的hdfs目录中,但是却收到了一条错误消息,该错误来自路径。

try:
# For Python 3.0 and later
from urllib.request import urlopen
except ImportError:
# Fall back to Python 2's urllib2
from urllib2 import urlopen

import StringIO
import gzip
from gzip import GzipFile


def dowload_and_save(x):
response = urlopen(x)

compressedFile = StringIO.StringIO()
compressedFile.write(response.read())

compressedFile.seek(0)

decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
with open('http://localhost:50070/webhdfs/user/root/ruben', 'w') as outfile:
outfile.write(decompressedFile.read())



url_lists=['https://dumps.wikimedia.org/other/pagecounts-raw/2007/2007-12/pagecounts-20071209-190000.gz','https://dumps.wikimedia.org/other/pagecounts-raw/2007/2007-12/pagecounts-20071209-200000.gz']

url_lists_rdd=sc.parallelize(url_lists)

url_lists_rdd.map(dowload_and_save)

最佳答案

我找到了解决方案

import boto
from boto.s3.key import Key
import requests
import os
os.environ['S3_USE_SIGV4'] = 'True'

def dowload_and_save(x):

bucket_name='magnet-fwm'
k = Key(bucket_name)

access_key=''
secret=''

r = requests.get(x)
#return (r.content)

c = boto.connect_s3(access_key, secret, host='s3-eu-west-1.amazonaws.com')
b = c.get_bucket(bucket_name,validate=False)

if r.status_code == 200:
#upload the file
k = Key(b)
k.key = "file.gz"

k.content_type = r.headers['content-type']
k.set_contents_from_string(r.content)
return 'a'



list=['https://dumps.wikimedia.org/other/pagecounts-raw/2007/2007-12/pagecounts-20071209-180000.gz','https://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-01/pagecounts-20080101-050000.gz']

url_lists_rdd=sc.parallelize(list)



#url_lists_rdd.map(lambda x: dowload_and_save(x,access_key,secret,bucket_name))
a=url_lists_rdd.map(dowload_and_save)

关于python - 将gzip文件保存在应用于rdd的函数中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46399999/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com