• 大数据开发之Flume实践


    文章目录

    1. 通过netcat作为source, sink为logger的方式

    1.1 conf文件配置
    # example.conf: 一个单节点的 Flume 实例配置
    
    # 配置Agent a1各个组件的名称
    a1.sources = r1    
    a1.sinks = k1      
    a1.channels = c1   
    
    # 配置Agent a1的source r1的属性
    a1.sources.r1.type = netcat       
    a1.sources.r1.bind = localhost    
    a1.sources.r1.port = 44444        
    
    # 配置Agent a1的sink k1的属性
    a1.sinks.k1.type = logger         
    
    # 配置Agent a1的channel c1的属性,channel是用来缓冲Event数据的
    a1.channels.c1.type = memory                
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    
    # 把source和sink绑定到channel上
    a1.sources.r1.channels = c1       
    a1.sinks.k1.channel = c1
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23

    这个配置文件定义了一个Agent叫做a1,a1有一个source监听本机44444端口上接收到的数据、一个缓冲数据的channel还有一个把Event数据输出到控制台的sink。这个配置文件给各个组件命名,并且设置了它们的类型和其他属性。通常一个配置文件里面可能有多个Agent,当启动Flume时候通常会传一个Agent名字来做为程序运行的标记。

    1.2 启动控制台
     ./bin/flume-ng agent --conf conf --conf-file ./conf/flume-netcat.conf -name a1 -Dflume.root.logger=INFO,console
    
    • 1
    1.3 远程连接端口
    [root@master ~]# telnet localhost 44444
    Trying ::1...
    telnet: connect to address ::1: Connection refused
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    1.4 测试
    [root@master ~]# telnet localhost 44444
    Trying ::1...
    telnet: connect to address ::1: Connection refused
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    hello
    OK
    word
    OK
    dzw
    OK
    ttt
    OK
    haddop^H
    OK
    spark
    OK
    flume
    OK
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    Flume的终端里面会以log的形式输出这个收到的Event内容。

    2021-01-19 16:05:27,669 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 65 6C 6C 6F 0D                               hello. }
    2021-01-19 16:05:29,842 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 77 6F 72 64 0D                                  word. }
    2021-01-19 16:05:38,846 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 64 7A 77 0D                                     dzw. }
    2021-01-19 16:14:24,955 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 74 74 74 0D                                     ttt. }
    2021-01-19 16:19:43,018 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 64 64 6F 70 08 0D                         haddop.. }
    2021-01-19 16:19:52,022 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 73 70 61 72 6B 0D                               spark. }
    2021-01-19 16:19:53,289 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 66 6C 75 6D 65 0D                               flume. }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    2. 通过netcat作为source, sink为logger的方式,只留字母,过滤掉数字

    2.1 配置conf文件
    # 配置Agent a1各个组件的名称
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    
    # 配置Agent a1的source r1的属性
    a1.sources.r1.type = netcat
    a1.sources.r1.bind = localhost
    a1.sources.r1.port = 44444
    
    # source定义正则匹配规则
    a1.sources.r1.interceptors = i1  
    a1.sources.r1.interceptors.i1.type =regex_filter  
    a1.sources.r1.interceptors.i1.regex =^[0-9]*$  
    a1.sources.r1.interceptors.i1.excludeEvents =true
    
    # 配置Agent a1的sink k1的属性
    a1.sinks.k1.type = logger
    
    # 配置Agent a1的channel c1的属性,channel是用来缓冲Event数据的
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    
    # 把source和sink绑定到channel上
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27

    增加了正则匹配规则部分

    2.2 启用控制台和远程连接

    同1

    2.3 测试
    [root@master ~]# telnet localhost 44444
    Trying ::1...
    telnet: connect to address ::1: Connection refused
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    liuyichang
    OK
    1234
    OK
    hand
    OK
    1199
    OK
    hahahaah
    OK
    1
    OK
    2
    OK
    3
    OK
    4dididi
    OK
    12wd34
    OK
    Connection closed by foreign host.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27

    查看输出

    2021-01-19 17:29:16,832 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 6C 69 75 79 69 63 68 61 6E 67 0D                liuyichang. }
    2021-01-19 17:29:31,836 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 6E 64 0D                                  hand. }
    2021-01-19 17:30:49,868 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 68 61 68 61 68 61 61 68 0D                      hahahaah. }
    2021-01-19 17:30:53,870 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 34 64 69 64 69 64 69 0D                         4dididi. }
    2021-01-19 17:31:09,362 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{} body: 31 32 77 64 33 34 0D                            12wd34. }
    
    • 1
    • 2
    • 3
    • 4
    • 5

    3. 通过netcat作为source, sink写到HDFS

    3.1 conf配置
    # 配置Agent a1各个组件的名称
    a1.sources = r1    
    a1.sinks = k1      
    a1.channels = c1   
    # 配置Agent a1的source r1的属性
    a1.sources.r1.type = netcat       
    a1.sources.r1.bind = localhost    
    a1.sources.r1.port = 44444        
    # 配置Agent a1的sink k1的属性
    #a1.sinks.k1.type = logger         
    a1.sinks.k1.type=hdfs
    #配置HDFS路径
    a1.sinks.k1.hdfs.path=hdfs:/flume
    #最终的文件前缀
    a1.sinks.k1.hdfs.filePrefix=events
    # 表示到了需要触发的时间时,是否要更新文件夹,true:表示是
    a1.sinks.k1.hdfs.round = true
    a1.sinks.k1.hdfs.roundValue = 10
    # 表示切换时间的单位是分钟
    a1.sinks.k1.hdfs.roundUnit = minute
    # 表示过了一分钟生成一个文件
    a1.sinks.k1.hdfs.roundInterval = 60 
    a1.sinks.k1.hdfs.fileType = DataStream
    # 配置Agent a1的channel c1的属性,channel是用来缓冲Event数据的
    a1.channels.c1.type = memory                
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100
    
    # 把source和sink绑定到channel上
    a1.sources.r1.channels = c1       
    a1.sinks.k1.channel = c1
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    3.2 启用控制台和远程连接

    启用控制台

    ./bin/flume-ng agent --conf conf --conf-file ./conf/flume-hdfs.conf -name a1 -Dflume.root.logge
    r=INFO,console  
    
    • 1
    • 2

    远程连接

    telnet localhost 44444
    
    • 1
    3.3 测试
    3.3.1 检验HDFS
    [root@master ~]# hadoop fs -ls / 
    Found 10 items
    -rw-r--r--   2 root supergroup       1005 2020-12-07 14:57 /core-site.xml
    drwxr-xr-x   - root supergroup          0 2020-12-13 17:41 /data
    drwxr-xr-x   - root supergroup          0 2020-12-08 11:30 /dzw
    drwxr-xr-x   - root supergroup          0 2020-12-14 18:06 /hadoop
    drwxr-xr-x   - root supergroup          0 2020-12-29 17:59 /mr_wc
    drwxr-xr-x   - root supergroup          0 2020-12-29 17:57 /output
    drwxr-xr-x   - root supergroup          0 2020-12-21 15:34 /prodata
    drwxr-xr-x   - root supergroup          0 2020-12-08 11:30 /test
    drwx-wx-wx   - root supergroup          0 2020-12-14 21:43 /tmp
    drwxr-xr-x   - root supergroup          0 2020-12-25 11:40 /user
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    可以看到此时没有flume文件夹

    3.3.2 输入测试
    [root@master apache-flume-1.6.0-bin]# telnet localhost 44444
    Trying ::1...
    telnet: connect to address ::1: Connection refused
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    qwq
    OK
    qqdeqd
    OK
    stupid
    OK
    liuyichang
    OK
    100086
    OK
    sichuan
    OK
    China
    OK
    panda
    OK
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    3.3.3 检验HDFS输出文件
    [root@slave1 ~]# hadoop fs -ls /
    Found 11 items
    -rw-r--r--   2 root supergroup       1005 2020-12-07 14:57 /core-site.xml
    drwxr-xr-x   - root supergroup          0 2020-12-13 17:41 /data
    drwxr-xr-x   - root supergroup          0 2020-12-08 11:30 /dzw
    drwxr-xr-x   - root supergroup          0 2021-01-20 16:26 /flume
    drwxr-xr-x   - root supergroup          0 2020-12-14 18:06 /hadoop
    drwxr-xr-x   - root supergroup          0 2020-12-29 17:59 /mr_wc
    drwxr-xr-x   - root supergroup          0 2020-12-29 17:57 /output
    drwxr-xr-x   - root supergroup          0 2020-12-21 15:34 /prodata
    drwxr-xr-x   - root supergroup          0 2020-12-08 11:30 /test
    drwx-wx-wx   - root supergroup          0 2020-12-14 21:43 /tmp
    drwxr-xr-x   - root supergroup          0 2020-12-25 11:40 /user
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    此时Flume运行自动在HDFS目录下创建了Flume文件夹

    [root@slave1 ~]# hadoop fs -ls /flume
    Found 1 items
    -rw-r--r--   2 root supergroup         13 2021-01-20 16:26 /flume/events.1611131189758.tmp
    [root@slave1 ~]# hadoop fs -ls /flume
    Found 1 items
    -rw-r--r--   2 root supergroup         13 2021-01-20 16:26 /flume/events.1611131189758.tmp
    [root@slave1 ~]# hadoop fs -ls /flume
    Found 2 items
    -rw-r--r--   2 root supergroup         21 2021-01-20 16:27 /flume/events.1611131189758
    -rw-r--r--   2 root supergroup         12 2021-01-20 16:27 /flume/events.1611131231774.tmp
    [root@slave1 ~]# hadoop fs -ls /flume
    Found 3 items
    -rw-r--r--   2 root supergroup         21 2021-01-20 16:27 /flume/events.1611131189758
    -rw-r--r--   2 root supergroup         29 2021-01-20 16:27 /flume/events.1611131231774
    -rw-r--r--   2 root supergroup         14 2021-01-20 16:27 /flume/events.1611131262116.tmp
    [root@slave1 ~]# hadoop fs -ls /flume
    Found 3 items
    -rw-r--r--   2 root supergroup         21 2021-01-20 16:27 /flume/events.1611131189758
    -rw-r--r--   2 root supergroup         29 2021-01-20 16:27 /flume/events.1611131231774
    -rw-r--r--   2 root supergroup         14 2021-01-20 16:28 /flume/events.1611131262116
    [root@slave1 ~]# hadoop fs -ls /flume/events.1611131189758   
    -rw-r--r--   2 root supergroup         21 2021-01-20 16:27 /flume/events.1611131189758
    [root@slave1 ~]# hadoop fs -cat /flume/events.1611131189758
    qwq
    qqdeqd
    stupid
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26

    Flume下能够查询到输入的信息。
    注意:出现tmp临时文件的原因
    因为在conf文件中配置了一分钟生成一个文件,一分钟之内写入的文件都将写入到tmp文件中,一分钟之后传入的信息将写入新的tmp文件中。

    如何设置flume防止小文件过多?
    a、限定一个文件的文件数据大小
    a1.sinks.k1.hdfs.rollSize = 200_1024_1024
    b、限定文件可以存储多少个event
    a1.sinks.k1.hdfs.rollCount = 10000

    4. 通过HTTP作为source, sink写到logger

    4.1 配置conf
    # Name the components on this agent
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1
    
    # 配置源
    a1.sources.r1.type=org.apache.flume.source.http.HTTPSource
    a1.sources.r1.bind=master
    a1.sources.r1.port=50020
    
    #配置目标
    a1.sinks.k1.type=logger
    
    #配置channel
    a1.channels.c1.type=memory
    a1.channels.c1.capacity=1000
    a1.channels.c1.transactionCapacity=100
    
    #绑定源和目标
    a1.sources.r1.channels=c1
    a1.sinks.k1.channel=c1
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    4.2 启动控制台
    ./bin/flume-ng agent --conf conf --conf-file ./conf/flume-http.conf -name a1 -Dflume.root.logge
    r=INFO,console
    
    • 1
    • 2
    4.3 输入HTTP测试
    [root@master ~]# curl -X POST -d '[{"headers" : {"timestamp" : "434324343","host" : "random_host.example.com"},"body" : "random_body"
    },{"headers" : {"namenode" : "namenode.example.com","datanode" : "random_datanode.example.com"},"body" : "liuyichang"}]' master:50020
    
    • 1
    • 2
    4.4 查看结果
    2021-01-20 17:20:26,958 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] 
    Event: { headers:{namenode=namenode.example.com, datanode=random_datanode.example.com} 
    body: 6C 69 75 79 69 63 68 61 6E 67                   liuyichang }
    
    • 1
    • 2
    • 3
  • 相关阅读:
    HashMap
    高并发场景QPS等专业指标揭秘大全与调优实战
    9.2 运用API实现线程同步
    如何让iOS设备上App定时执行后台任务(上)
    SOLIDWORKS 专业显卡要求
    基于Spring Boot的超时代停车场管理平台-计算机毕业设计
    转码(BIN→ASIIC/BIN→BCD)
    Gem5 Bug Record
    vite+vue3.0 使用tailwindcss
    mysql 常用操作
  • 原文地址:https://blog.csdn.net/m0_67394002/article/details/126565635