gpt4 book ai didi

python - Pydoop 卡在 HDFS 文件的 readline 上

转载 作者:太空狗 更新时间:2023-10-29 21:52:38 24 4
gpt4 key购买 nike

我正在读取目录中所有文件的第一行,在本地它工作正常,但在 EMR 上,此测试在卡在大约 200-300 个文件时失败。ps -eLF 还显示子项增加到 3000,甚至在第 200 行打印。

这是 EMR 读取最大字节数的一些错误吗?pydoop版本pydoop==0.12.0

import os
import sys
import shutil
import codecs
import pydoop.hdfs as hdfs


def prepare_data(hdfs_folder):
folder = "test_folder"
copies_count = 700
src_file = "file"

#1) create a folder
if os.path.exists(folder):
shutil.rmtree(folder)
os.makedirs(folder)

#2) create XXX copies of file in folder
for x in range(0, copies_count):
shutil.copyfile(src_file, folder+"/"+src_file+"_"+str(x))

#3) copy folder to hdfs
#hadoop fs -copyFromLocal test_folder/ /maaz/test_aa
remove_command = "hadoop fs -rmr "+ hdfs_folder
print remove_command
os.system(remove_command)
command = "hadoop fs -copyFromLocal "+folder+" "+ hdfs_folder
print command
os.system(command)

def main(hdfs_folder):
try:
conn_hdfs = hdfs.fs.hdfs()
if conn_hdfs.exists(hdfs_folder):
items_list = conn_hdfs.list_directory(hdfs_folder)
for item in items_list:
if not item["kind"] == "file":
continue
file_name = item["name"]
print "validating file : %s" % file_name

try:
file_handle = conn_hdfs.open_file(file_name)
file_line = file_handle.readline()
print file_line
file_handle.close()
except Exception as exp:
print '####Exception \'%s\' in reading file %s' % (str(exp), file_name)
file_handle.close()
continue

conn_hdfs.close()

except Exception as e:
print "####Exception \'%s\' in validating files!" % str(e)



if __name__ == '__main__':

hdfs_path = '/abc/xyz'
prepare_data(hdfs_path)

main(hdfs_path)

最佳答案

我建议使用 subprocess 模块来读取第一行,而不是 pydoopconn_hdfs.open_file

import subprocess
cmd='hadoop fs -cat {f}|head -1'.format(f=file_name)
process=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
stdout, stderr=process.communicate()
if stderr!='':
file_line=stdout.split('\n')[0]
else:
print "####Exception '{e}' in reading file {f}".format(f=file_name,e=stdout)
continue

关于python - Pydoop 卡在 HDFS 文件的 readline 上,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28692535/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com