gpt4 book ai didi

hadoop - 根据文件大小滚动时,水槽需要时间将数据复制到 hdfs

转载 作者:可可西里 更新时间:2023-11-01 16:41:44 26 4
gpt4 key购买 nike

我有一个用例,我想使用 flume 将远程文件复制到 hdfs。我还希望复制的文件应与 HDFS block 大小 (128MB/256MB) 对齐。远程数据的总大小为 33GB。

我正在使用 avro source 和 sink 将远程数据复制到 hdfs 中。类似地,从接收器端我正在滚动文件大小(128,256)。但是为了从远程机器复制文件并将其存储到 hdfs(文件大小 128/256 MB)中,flume 平均需要 2 分钟。

水槽配置:Avro 源(远程机器)

### Agent1 - Spooling Directory Source and File Channel, Avro Sink  ###
# Name the components on this agent
Agent1.sources = spooldir-source
Agent1.channels = file-channel
Agent1.sinks = avro-sink

# Describe/configure Source
Agent1.sources.spooldir-source.type = spooldir
Agent1.sources.spooldir-source.spoolDir =/home/Benchmarking_Simulation/test


# Describe the sink
Agent1.sinks.avro-sink.type = avro
Agent1.sinks.avro-sink.hostname = xx.xx.xx.xx #IP Address destination machine
Agent1.sinks.avro-sink.port = 50000

#Use a channel which buffers events in file
Agent1.channels.file-channel.type = file
Agent1.channels.file-channel.checkpointDir = /home/Flume_CheckPoint_Dir/
Agent1.channels.file-channel.dataDirs = /home/Flume_Data_Dir/
Agent1.channels.file-channel.capacity = 10000000
Agent1.channels.file-channel.transactionCapacity=50000

# Bind the source and sink to the channel
Agent1.sources.spooldir-source.channels = file-channel
Agent1.sinks.avro-sink.channel = file-channel

Avro Sink(运行 hdfs 的机器)

### Agent1 - Avro Source and File Channel, Avro Sink  ###
# Name the components on this agent
Agent1.sources = avro-source1
Agent1.channels = file-channel1
Agent1.sinks = hdfs-sink1

# Describe/configure Source
Agent1.sources.avro-source1.type = avro
Agent1.sources.avro-source1.bind = xx.xx.xx.xx
Agent1.sources.avro-source1.port = 50000

# Describe the sink
Agent1.sinks.hdfs-sink1.type = hdfs
Agent1.sinks.hdfs-sink1.hdfs.path =/user/Benchmarking_data/multiple_agent_parallel_1
Agent1.sinks.hdfs-sink1.hdfs.rollInterval = 0
Agent1.sinks.hdfs-sink1.hdfs.rollSize = 130023424
Agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
Agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
Agent1.sinks.hdfs-sink1.hdfs.batchSize = 50000
Agent1.sinks.hdfs-sink1.hdfs.txnEventMax = 40000
Agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize=1000
Agent1.sinks.hdfs-sink1.hdfs.appendTimeout = 10000
Agent1.sinks.hdfs-sink1.hdfs.callTimeout = 200000


#Use a channel which buffers events in file
Agent1.channels.file-channel1.type = file
Agent1.channels.file-channel1.checkpointDir = /home/Flume_Check_Point_Dir
Agent1.channels.file-channel1.dataDirs = /home/Flume_Data_Dir
Agent1.channels.file-channel1.capacity = 100000000
Agent1.channels.file-channel1.transactionCapacity=100000


# Bind the source and sink to the channel
Agent1.sources.avro-source1.channels = file-channel1
Agent1.sinks.hdfs-sink1.channel = file-channel1

两台机器之间的网络连接为 686 Mbps。

有人可以帮我确定配置或备用配置中是否有问题,以便复制不会花费太多时间。

最佳答案

两个代理都使用文件 channel 。所以在写入HDFS之前,数据已经写入磁盘两次。您可以尝试为每个代理使用一个内存 channel ,看看性能是否有所提高。

关于hadoop - 根据文件大小滚动时,水槽需要时间将数据复制到 hdfs,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40527476/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com