hdfs - 在 Flume-ng 中使用 HDFS Sink 和 rollInterval 来批量处理 90 秒的日志信息-6ren

hdfs - 在 Flume-ng 中使用 HDFS Sink 和 rollInterval 来批量处理 90 秒的日志信息

转载作者：行者123 更新时间：2023-12-05 01:13:16

我正在尝试使用 Flume-ng 抓取 90 秒的日志信息并将其放入 HDFS 中的文件中。我让水槽通过 exec 和 tail 查看日志文件，但是它每 5 秒创建一个文件，而不是每 90 秒创建一个文件。

我的flume.conf如下:

# example.conf: A single-node Flume configuration                                                                                                                  
# Name the components on this agent                                                                                                                                
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Describe/configure source1                                                                                                                                       
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -f /home/cloudera/LogCreator/fortune_log.log

# Describe sink1                                                                                                                                                   
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://localhost/flume/logtest/
agent1.sinks.sink1.hdfs.filePrefix = LogCreateTest
# this parameter seems to be getting overridden                                                                                                                    
agent1.sinks.sink1.hdfs.rollInterval=90
agent1.sinks.sink1.hdfs.rollSize=0
agent1.sinks.sink1.hdfs.hdfs.rollCount = 0

# Use a channel which buffers events in memory                                                                                                                     
agent1.channels.channel1.type = memory

# Bind the source and sink to the channel                                                                                                                          
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

我试图通过参数控制文件大小 - agent1.sinks.sink1.hdfs.rollInterval=90 .

运行此配置会产生:

13/01/03 09:43:02 INFO properties.PropertiesFileConfigurationProvider: Reloading configuration file:/etc/flume-ng/conf/flume.conf
13/01/03 09:43:02 INFO conf.FlumeConfiguration: Processing:sink1
13/01/03 09:43:02 INFO conf.FlumeConfiguration: Processing:sink1
13/01/03 09:43:02 INFO conf.FlumeConfiguration: Processing:sink1
13/01/03 09:43:02 INFO conf.FlumeConfiguration: Processing:sink1
13/01/03 09:43:02 INFO conf.FlumeConfiguration: Processing:sink1
13/01/03 09:43:02 INFO conf.FlumeConfiguration: Processing:sink1
13/01/03 09:43:02 INFO conf.FlumeConfiguration: Processing:sink1
13/01/03 09:43:02 INFO conf.FlumeConfiguration: Added sinks: sink1 Agent: agent1
13/01/03 09:43:03 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration  for agents: [agent1]
13/01/03 09:43:03 INFO properties.PropertiesFileConfigurationProvider: Creating channels
13/01/03 09:43:03 INFO instrumentation.MonitoredCounterGroup: Monitoried counter group for type: CHANNEL, name: channel1, registered successfully.
13/01/03 09:43:03 INFO properties.PropertiesFileConfigurationProvider: created channel channel1
13/01/03 09:43:03 INFO sink.DefaultSinkFactory: Creating instance of sink: sink1, type: hdfs
13/01/03 09:43:03 INFO hdfs.HDFSEventSink: Hadoop Security enabled: false
13/01/03 09:43:03 INFO instrumentation.MonitoredCounterGroup: Monitoried counter group for type: SINK, name: sink1, registered successfully.
13/01/03 09:43:03 INFO nodemanager.DefaultLogicalNodeManager: Starting new configuration:{ sourceRunners:{source1=EventDrivenSourceRunner: { source:org.apache.flume.source.ExecSource{name:source1,state:IDLE} }} sinkRunners:{sink1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@1a50ca0c counterGroup:{ name:null counters:{} } }} channels:{channel1=org.apache.flume.channel.MemoryChannel{name: channel1}} }
13/01/03 09:43:03 INFO nodemanager.DefaultLogicalNodeManager: Starting Channel channel1
13/01/03 09:43:03 INFO instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: channel1 started
13/01/03 09:43:03 INFO nodemanager.DefaultLogicalNodeManager: Starting Sink sink1
13/01/03 09:43:03 INFO nodemanager.DefaultLogicalNodeManager: Starting Source source1
13/01/03 09:43:03 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: sink1 started
13/01/03 09:43:03 INFO source.ExecSource: Exec source starting with command:tail -f /home/cloudera/LogCreator/fortune_log.log
13/01/03 09:43:07 INFO hdfs.BucketWriter: Creating hdfs://localhost/flume/logtest//LogCreateTest.1357224186506.tmp
13/01/03 09:43:08 INFO hdfs.BucketWriter: Renaming hdfs://localhost/flume/logtest/LogCreateTest.1357224186506.tmp to hdfs://localhost/flume/logtest/LogCreateTest.1357224186506
13/01/03 09:43:08 INFO hdfs.BucketWriter: Creating hdfs://localhost/flume/logtest//LogCreateTest.1357224186507.tmp
13/01/03 09:43:12 INFO hdfs.BucketWriter: Renaming hdfs://localhost/flume/logtest/LogCreateTest.1357224186507.tmp to hdfs://localhost/flume/logtest/LogCreateTest.1357224186507
13/01/03 09:43:12 INFO hdfs.BucketWriter: Creating hdfs://localhost/flume/logtest//LogCreateTest.1357224186508.tmp
13/01/03 09:43:12 INFO hdfs.BucketWriter: Renaming hdfs://localhost/flume/logtest/LogCreateTest.1357224186508.tmp to hdfs://localhost/flume/logtest/LogCreateTest.1357224186508
13/01/03 09:43:12 INFO hdfs.BucketWriter: Creating hdfs://localhost/flume/logtest//LogCreateTest.1357224186509.tmp
13/01/03 09:43:18 INFO hdfs.BucketWriter: Renaming hdfs://localhost/flume/logtest/LogCreateTest.1357224186509.tmp to hdfs://localhost/flume/logtest/LogCreateTest.1357224186509
13/01/03 09:43:18 INFO hdfs.BucketWriter: Creating hdfs://localhost/flume/logtest//LogCreateTest.1357224186510.tmp
13/01/03 09:43:18 INFO hdfs.BucketWriter: Renaming hdfs://localhost/flume/logtest/LogCreateTest.1357224186510.tmp to hdfs://localhost/flume/logtest/LogCreateTest.1357224186510

正如您可以通过时间戳看到的，它大约每 5 秒左右创建一个文件。这会创建许多小文件。

我希望能够以更大的时间间隔(90 秒)创建文件。

最佳答案

重写配置文件，指定更完整的参数选择。此示例将在 10k 条记录或 10 分钟后写入，以先到者为准。此外，我从内存 channel 更改为文件 channel ，以帮助提高数据流的可靠性。

agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Describe/configure source1                                                                                                                                                                                                                 
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -f /home/cloudera/LogCreator/fortune_log.log

# Describe sink1                                                                                                                                                                                                                             
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://localhost/flume/logtest/
agent1.sinks.sink1.hdfs.filePrefix = LogCreateTest
# Number of seconds to wait before rolling current file (0 = never roll based on time interval)                                                                                                                                              
agent1.sinks.sink1.hdfs.rollInterval = 600
# File size to trigger roll, in bytes (0: never roll based on file size)                                                                                                                                                                     
agent1.sinks.sink1.hdfs.rollSize = 0
#Number of events written to file before it rolled (0 = never roll based on number of events)                                                                                                                                                
agent1.sinks.sink1.hdfs.rollCount = 10000
# number of events written to file before it flushed to HDFS                                                                                                                                                                                 
agent1.sinks.sink1.hdfs.batchSize = 10000
agent1.sinks.sink1.hdfs.txnEventMax = 40000
# -- Compression codec. one of following : gzip, bzip2, lzo, snappy                                                                                                                                                                          
# hdfs.codeC = gzip                                                                                                                                                                                                                          
#format: currently SequenceFile, DataStream or CompressedStream                                                                                                                                                                              
#(1)DataStream will not compress output file and please don't set codeC                                                                                                                                                                      
#(2)CompressedStream requires set hdfs.codeC with an available codeC                                                                                                                                                                         
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.maxOpenFiles=50
# -- "Text" or "Writable"                                                                                                                                                                                                                    
#hdfs.writeFormat                                                                                                                                                                                                                            
agent1.sinks.sink1.hdfs.appendTimeout = 10000
agent1.sinks.sink1.hdfs.callTimeout = 10000
# Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)                                                                                                                                                                        
agent1.sinks.sink1.hdfs.threadsPoolSize=100
# Number of threads per HDFS sink for scheduling timed file rolling                                                                                                                                                                          
agent1.sinks.sink1.hdfs.rollTimerPoolSize = 1
# hdfs.kerberosPrin--cipal Kerberos user principal for accessing secure HDFS                                                                                                                                                                 
# hdfs.kerberosKey--tab Kerberos keytab for accessing secure HDFS                                                                                                                                                                            
# hdfs.round false Should the timestamp be rounded down (if true, affects all time based escape sequences except %t)                                                                                                                         
# hdfs.roundValue1 Rounded down to the highest multiple of this (in the unit configured using                                                                                                                                                
# hdfs.roundUnit), less than current time.                                                                                                                                                                                                   
# hdfs.roundUnit second The unit of the round down value - second, minute or hour.                                                                                                                                                           
# serializer TEXT Other possible options include AVRO_EVENT or the fully-qualified class name of an implementation of the EventSerializer.Builder interface.                                                                                 
# serializer.*                                                                                                                                                                                                                               


# Use a channel which buffers events to a file                                                                                                                                                                                               
# -- The component type name, needs to be FILE.                                                                                                                                                                                              
agent1.channels.channel1.type = FILE
# checkpointDir ~/.flume/file-channel/checkpoint The directory where checkpoint file will be stored                                                                                                                                          
# dataDirs ~/.flume/file-channel/data The directory where log files will be stored                                                                                                                                                           
# The maximum size of transaction supported by the channel                                                                                                                                                                                   
agent1.channels.channel1.transactionCapacity = 1000000
# Amount of time (in millis) between checkpoints                                                                                                                                                                                             
agent1.channels.channel1.checkpointInterval 30000
# Max size (in bytes) of a single log file                                                                                                                                                                                                   
agent1.channels.channel1.maxFileSize = 2146435071
# Maximum capacity of the channel                                                                                                                                                                                                            
agent1.channels.channel1.capacity 10000000
#keep-alive 3 Amount of time (in sec) to wait for a put operation                                                                                                                                                                            
#write-timeout 3 Amount of time (in sec) to wait for a write operation                                                                                                                                                                       

# Bind the source and sink to the channel                                                                                                                                                                                                    
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

关于hdfs - 在 Flume-ng 中使用 HDFS Sink 和 rollInterval 来批量处理 90 秒的日志信息，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/14141287/

文章推荐： rpm - 如何控制 rpmbuild 命令放置 .rpm 文件的位置

文章推荐： Neo4j 索引和遗留数据

文章推荐： opengl - 全向阴影贴图聚光灯

swift - .sink 和 Subscribers.Sink 有什么区别？
我想用 Future 做一个异步工作。但是下面的 .sink() 闭包永远不会被调用。看起来 Future 的实例在它被调用后就被释放了。 Future { promise in
serilog - Serilog.Sinks.Console 接收器是否从包装到 Serilog.Sinks.Async 接收器中获得任何好处？
我在我的 aspnet core 应用程序中使用 Serilog 进行日志记录。我需要频繁地将日志事件写入控制台(每秒 300-500 个事件)。我在 docker 容器内运行我的应用程序，并使用 O
r - 使用 sink() 在 foreach 循环中捕获消息时如何避免 'sink stack is full' 错误
为了查看在 foreach() 中运行的函数输出的控制台消息循环我遵循了 this guy 的建议并添加了一个 sink()像这样调用: library(foreach) libra
elasticsearch - Elasticsearch Sink Connector 是否像 JDBC sink 连接器一样支持主键上的 upsert 模式？
我正在使用 kafka connect 从 Mongodb -> Elasticsearch 移动数据。目前，更新的记录作为新文档插入到 Elasticsearch 索引中。但是，我想根据 ID 更
r - For-Loop 内的 Sink 在 R 中给出错误 "sink stack is full"
我在 R 中处理以下数据: > str(df) 'data.frame': 369269 obs. of 12 variables: $ bkod : int 110006 110006 1
async-await - 为什么 `Box` 在使用 Tokio 和实验性异步/等待支持时不实现 `Sink` 特征？
我正在使用 Tokio 玩异步/等待功能，并在我的 Cargo.toml 中启用了异步/等待功能。 (以及 2018 年版的最新 Rust nightly): tokio = { version =
swift - 为什么我的合并管道中从未调用 sink ？
这个问题在这里已经有了答案: URLSession.shared.dataTaskPublisher not working on IOS 13.3 (3 个回答) 1年前关闭。这是我的管道: UR
twitter - Flume与TwitterSource和Elasticsearch Sink
我正在尝试使用flume来使用Twitter Stream API并将该tweet索引到我的elasticsearch中。我将flume.conf设置为使用com.cloudera.flume.sou
hadoop - 需要帮助使用flume调试kafka源到hdfs sink
我正在尝试将数据从 kafka(最终我们将使用在不同实例上运行的 kafka)发送到 hdfs。我认为 Flume 或某种摄取协议(protocol)对于将数据导入 hdfs 是必要的。所以我们使用c
r - sink() 同时在控制台中显示输出
我怎样才能将输出重定向到某个 txt 文件，但以这样一种方式，以便我可以在逐步生成的同时在控制台中同时看到该输出？最佳答案只需使用 sink 的 split=TRUE 参数: sink(file=
javascript - 滚动后如何制作粘性页脚 "sink"？
我一直在疯狂地尝试让这一切发生，但我就是想不通(初学者)。正如你所看到的，当你向下滚动时，顶部的头部部分会粘在页面的顶部，但也会溢出一点。这是用 stickyjs 完成的。我也想对头部底部做同样的事
Serilog.Sinks.Email - 没有失败的迹象？
我一直在使用 Serilog.Sinks.Email，我注意到(除了我没有收到电子邮件这一事实)，当我执行 Log 语句时没有异常或任何表明失败的信息.即使我为 MailServer 放入垃圾邮件，L
pulseaudio - 按属性查找pulseaudio sink-input索引
两者的输出 pactl list sink-inputs和 pacmd list-sink-inputs包含一个属性部分: Properties: media.name = "ALSA Pla
haskell - 将功能提升到 Conduit Sink
我有一个函数f :: ByteString -> String ，并且需要一个 Sink ByteString (ResourceT IO) . 我怎么得到这个？不幸的是，这些文档不是很有帮助...
快速组合 : how to create custom sink?
我将 RxSwift 与以下类似的东西一起使用 extension Reactive where Base: UIViewController { public var showError:
apache-spark - 如何在pyspark中使用foreach sink？
如何在 Python Spark 结构化流中使用 foreach 来触发输出操作。 query = wordCounts\ .writeStream\ .outputMode('upd
R sink() : Error, 无法拆分消息连接
我正在尝试将 R 脚本的错误和警告记录到外部文件中。同时我希望能够在 RStudio 的控制台中看到错误和警告(对开发和调试有用)。我正在尝试使用以下代码: logfile <- file("my_f
R sink() 消息并输出到同一文件 - 完整性检查
我正在使用 R 的 sink() 函数将错误、警告、消息和控制台输出捕获到单个文本文件中。我想知道是否同时沉没两个留言和输出类型到单个打开的文件是不好的吗？我将上述所有内容捕获到一个文件中
elasticsearch - 容器化WebApp的日志不会通过Serilog.Sinks.Elasticsearch发送到Elasticsearch
我正在测试Serilog.Sinks.Elasticsearch库和ELK堆栈(Elasticsearch＆Kibana)的组合，以从我的ASP.NET Core 2.2应用程序中收集日志。 Web应
scala - 结构化流 - Foreach Sink
我基本上是从 Kafka 源读取数据，并将每条消息转储到我的 foreach 处理器(感谢 Jacek 页面提供的简单示例)。如果这确实有效，我实际上应该在此处的 process 方法中执行一些业务

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

hdfs - 在 Flume-ng 中使用 HDFS Sink 和 rollInterval 来批量处理 90 秒的日志信息