gpt4 book ai didi

hadoop - 如何将 Kafka 主题加载到 HDFS?

转载 作者:可可西里 更新时间:2023-11-01 14:16:29 24 4
gpt4 key购买 nike

我正在使用 hortonworks 沙箱。
创建主题:

./kafka-topics.sh --create --zookeeper 10.25.3.207:2181 --replication-factor 1 --partitions 1 --topic lognew  

跟踪 apache 访问日志目录:

tail -f  /var/log/httpd/access_log |./kafka-console-producer.sh --broker-list 10.25.3.207:6667 --topic lognew  

在另一个终端(kafka bin)启动消费者:

./kafka-console-consumer.sh --zookeeper 10.25.3.207:2181 --topic lognew --from-beginning  

apache访问日志发送到kafka主题“lognew”。

我需要将它们存储到 HDFS。
关于如何做到这一点的任何想法或建议。

提前致谢。
深沉

最佳答案

我们使用 camus .

Camus is a simple MapReduce job developed by LinkedIn to load data from Kafka into HDFS. It is capable of incrementally copying data from Kafka into HDFS such that every run of the MapReduce job picks up where the previous run left off. At LinkedIn, Camus is used to load billions of messages per day from Kafka into HDFS.

不过好像换成了gobblin

Gobblin is a universal data ingestion framework for extracting, transforming, and loading large volume of data from a variety of data sources, e.g., databases, rest APIs, FTP/SFTP servers, filers, etc., onto Hadoop. Gobblin handles the common routine tasks required for all data ingestion ETLs, including job/task scheduling, task partitioning, error handling, state management, data quality checking, data publishing, etc. Gobblin ingests data from different data sources in the same execution framework, and manages metadata of different sources all in one place. This, combined with other features such as auto scalability, fault tolerance, data quality assurance, extensibility, and the ability of handling data model evolution, makes Gobblin an easy-to-use, self-serving, and efficient data ingestion framework.

关于hadoop - 如何将 Kafka 主题加载到 HDFS?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33864443/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com