gpt4 book ai didi

flume - 使用 Flume(spool 目录)将大文件加载到 hdfs

转载 作者:行者123 更新时间:2023-12-03 13:52:41 24 4
gpt4 key购买 nike

我们将一个 150 mb 的 csv 文件复制到水槽的 spool 目录中,当它被加载到 hdfs 中时,该文件被拆分成更小的文件,例如 80 kb 的文件。有没有办法加载文件而不会使用水槽拆分成更小的文件?因为在namenode内部会生成更多关于较小文件的元数据,所以我们需要避免它。

我的水槽代码看起来像这样

# Initialize agent's source, channel and sink
agent.sources = TwitterExampleDir
agent.channels = memoryChannel
agent.sinks = flumeHDFS

# Setting the source to spool directory where the file exists
agent.sources.TwitterExampleDir.type = spooldir
agent.sources.TwitterExampleDir.spoolDir = /usr/local/flume/live

# Setting the channel to memory
agent.channels.memoryChannel.type = memory
# Max number of events stored in the memory channel
agent.channels.memoryChannel.capacity = 10000
# agent.channels.memoryChannel.batchSize = 15000
agent.channels.memoryChannel.transactioncapacity = 1000000

# Setting the sink to HDFS
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path = hdfs://info3s7:54310/spool5
agent.sinks.flumeHDFS.hdfs.fileType = DataStream

# Write format can be text or writable
agent.sinks.flumeHDFS.hdfs.writeFormat = Text

# use a single csv file at a time
agent.sinks.flumeHDFS.hdfs.maxOpenFiles = 1

# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount=0
agent.sinks.flumeHDFS.hdfs.rollInterval=2000
agent.sinks.flumeHDFS.hdfs.rollSize = 0
agent.sinks.flumeHDFS.hdfs.batchSize =1000000

# never rollover based on the number of events
agent.sinks.flumeHDFS.hdfs.rollCount = 0

# rollover file based on max time of 1 min
#agent.sinks.flumeHDFS.hdfs.rollInterval = 0
# agent.sinks.flumeHDFS.hdfs.idleTimeout = 600

# Connect source and sink with channel
agent.sources.TwitterExampleDir.channels = memoryChannel
agent.sinks.flumeHDFS.channel = memoryChannel

最佳答案

你想要的是这个:

# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount = 0
agent.sinks.flumeHDFS.hdfs.rollInterval = 0
agent.sinks.flumeHDFS.hdfs.rollSize = 10000000
agent.sinks.flumeHDFS.hdfs.batchSize = 10000

来自 flume documentation
hdfs.rollSize: File size to trigger roll, in bytes (0: never roll based on file size)

在您的示例中,您使用 滚动间隔 2000 将在 2000 秒后翻转文件,从而产生小文件。

另请注意 批量大小 反射(reflect)文件刷新到 HDFS 之前的事件数,不一定是文件关闭和创建新文件之前的事件数。您需要将其设置为某个值,该值足够小,不会因写入大文件而超时,但要足够大,以避免对 HDFS 的许多请求造成开销。

关于flume - 使用 Flume(spool 目录)将大文件加载到 hdfs,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22527089/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com