gpt4 book ai didi

hadoop - Flume Twitter Stream在HDFS中滚动小文件

转载 作者:行者123 更新时间:2023-12-02 21:59:27 27 4
gpt4 key购买 nike

我想我已经尝试过更改配置文件的所有组合。我还在某处看到这可能是由于我的复制因子为3所致,所以我将其更改为1。我在AWS上使用cloudera manager。以下是我的配置文件,有什么想法吗?

在HDFS中,文件大小都在20kb以下,试图至少达到40-50mb。有趣的是,同一配置文件正在我正在使用的虚拟机(预安装的hadoop +工具)上写入约60mb的文件。看到下面的配置文件,有什么想法吗?

# The configuration file needs to define the sources, 
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'TwitterAgent'

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.consumerSecret = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessToken = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.keywords = apple, grapes, fruits, strawberry, mango, pear
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://123.456.789.us-west-2.compute.amazonaws.com:8020/user/flume/tweets
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 0

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000

最佳答案

如果rollIntervalbatchSizerollSizerollCount不起作用,则其余内容看起来像hdfs.callTimeout

因为有人说减少复制因子可能是解决方案。

减少复制因子意味着减少hdfs操作时间,并且根据槽用户guidelinecallTimeout的默认值为10000毫秒。

其他线索是

  • How-to: Do Apache Flume Performance Tuning (Part 1)
  • How can I force Flume-NG to process the backlog of events after a sink failed?
  • Using an HDFS Sink and rollInterval in Flume-ng to batch up 90 seconds of log information
  • 关于hadoop - Flume Twitter Stream在HDFS中滚动小文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25745751/

    27 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com