gpt4 book ai didi

python - Amazon EMR 上的 Pydoop

转载 作者:太空狗 更新时间:2023-10-29 20:54:39 25 4
gpt4 key购买 nike

我将如何使用 Pydoop在 Amazon EMR 上?

我尝试用谷歌搜索这个主题但无济于事:有可能吗?

最佳答案

我终于搞定了。一切都发生在主节点上...作为用户 hadoop ssh 到该节点

你需要一些包:

sudo easy_install argparse importlib
sudo apt-get update
sudo apt-get install libboost-python-dev

构建东西:

wget http://apache.mirrors.pair.com/hadoop/common/hadoop-0.20.205.0/hadoop-0.20.205.0.tar.gz
wget http://sourceforge.net/projects/pydoop/files/Pydoop-0.6/pydoop-0.6.0.tar.gz
tar xvf hadoop-0.20.205.0.tar.gz
tar xvf pydoop-0.6.0.tar.gz

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export JVM_ARCH=64 # I assume that 32 works for 32-bit systems
export HADOOP_HOME=/home/hadoop
export HADOOP_CPP_SRC=/home/hadoop/hadoop-0.20.205.0/src/c++/
export HADOOP_VERSION=0.20.205
export HDFS_LINK=/home/hadoop/hadoop-0.20.205.0/src/c++/libhdfs/

cd ~/hadoop-0.20.205.0/src/c++/libhdfs
sh ./configure
make
make install
cd ../install
tar cvfz ~/libhdfs.tar.gz lib
sudo tar xvf ~/libhdfs.tar.gz -C /usr

cd ~/pydoop-0.6.0
python setup.py bdist
cp dist/pydoop-0.6.0.linux-x86_64.tar.gz ~/
sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C /

保存这两个 tarball,以后您可以跳过构建部分,只需执行以下操作即可安装(需要弄清楚如何执行此操作,以便在多节点集群上安装 boostrap 选项)

sudo tar xvf ~/libhdfs.tar.gz -C /usr
sudo tar xvf ~/pydoop-0.6.0.linux-x86_64.tar.gz -C /

然后我能够使用 Full-fledged Hadoop API 运行示例程序(修复构造函数中的一个错误,使其调用 super(WordCountMapper, self))。

#!/usr/bin/python

import pydoop.pipes as pp

class WordCountMapper(pp.Mapper):

def __init__(self, context):
super(WordCountMapper, self).__init__(context)
context.setStatus("initializing")
self.input_words = context.getCounter("WORDCOUNT", "INPUT_WORDS")

def map(self, context):
words = context.getInputValue().split()
for w in words:
context.emit(w, "1")
context.incrementCounter(self.input_words, len(words))

class WordCountReducer(pp.Reducer):

def reduce(self, context):
s = 0
while context.nextValue():
s += int(context.getInputValue())
context.emit(context.getInputKey(), str(s))

pp.runTask(pp.Factory(WordCountMapper, WordCountReducer))

我将该程序上传到存储桶并调用它运行。然后我使用了以下 conf.xml:

<?xml version="1.0"?>
<configuration>

<property>
<name>hadoop.pipes.executable</name>
<value>s3://<my bucket>/run</value>
</property>

<property>
<name>mapred.job.name</name>
<value>myjobname</value>
</property>

<property>
<name>hadoop.pipes.java.recordreader</name>
<value>true</value>
</property>

<property>
<name>hadoop.pipes.java.recordwriter</name>
<value>true</value>
</property>

</configuration>

最后,我使用了以下命令行:

hadoop pipes -conf conf.xml -input s3://elasticmapreduce/samples/wordcount/input -output s3://tmp.nou/asdf

关于python - Amazon EMR 上的 Pydoop,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10730311/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com