flume学习笔记（二） flum事务和部署架构解析

总字数5.6k 预计阅读25分钟

1 Flume事务

一提到事务，首先就想到的是关系型数据库中的事务，事务一个典型的特征就是将一批操作做成原子性的，要么都成功，要么都失败。

flume事务就是保证fllume能安全正常的运行，保证服务的可靠性，安全性。

在Flume中一共有两个事务:

Put事务：在Source到Channel之间
Take事务：Channel到Sink之间

从Source到Channel过程中，数据在Flume中会被封装成Event对象，也就是一批Event，把这批Event放到一个事务中，把这个事务也就是这批event一次性的放入Channel中。同理，Take事务的时候，也是把这一批event组成的事务统一拿出来到sink放到HDFS上。

事务具体流程如下图所示：

1.1 Put事务流程

事务开始的时候会调用一个doPut 方法，doPut方法将一批数据放在putList中;
- putList在向Channel发送数据之前先检查Channel的容量能否放得下，如果放不下一个都不放，只能doRollback;
- 数据批的大小取决于配置参数batch size的值;
- putList的大小取决于配置Channel的参数transaction capacity的大小，该参数大小就体现在putList上;(Channel的另一个参数capacity指的是Channel的容量);
数据顺利的放到putList之后，Flume中的 Take 事务
Take事务同样也有takeList，HDFS sink配置有一个batch size，这个参数决定Sink从Channel 取数据的时候一次取多少个，所以该batch size得小于takeList的大小，而takeList的大小取决于 transaction capacity 的大小，同样是channel中的参数。其中putlist，takelist容量可以通过下面配置：a3.channels.c3.transactionCapacity = 100

1.2 Take事务流程:

事务开始后

doTake方法会将channel中的event剪切到takeList中。如果后面接的是HDFS Sink的话，在把Channel中的event剪切到takeList中的同时也往写入HDFS的IO缓冲流中放一份event(数据写入HDFS是先写入IO缓冲流然后flush到HDFS);
当takeList中存放了batch size 数量的event之后，就会调用doCommit方法，doCommit方法会做两个操作:
- 针对HDFS Sink，手动调用IO流的flush方法，将IO流缓冲区的数据写入到HDFS磁盘中;
- 清空takeList中的数据
flush到HDFS的时候组容易出问题。flush到HDFS的时候，可能由于网络原因超时导致数据传输失败，这个时候调用doRollback方法来进行回滚，回滚的时候由于takeList中还有备份数据，所以将takeList中的数据原封不动地还给channel，这时候就完成了事务的回滚。
但是，如果flush到HDFS的时候，数据flush了一半之后出问题了，这意味着已经有一半的数据已经发送到HDFS上面了，现在出了问题，同样需要调用doRollback方法来进行回滚，回滚并没有“一半”之说，它只会把整个takeList中的数据返回给 channel，然后继续进行数据的读写。这样开启下一个事务的时候容易造成数据重复的问题。接下来可以调用doCommit方法，把putList中所有的Event放到 Channel 中，成功放完之后就清空putList;

在doCommit提交之后，事务在向Channel存放数据的过程中，事务容易出问题。如Sink取数据慢，而Source放数据速度快，容易造成Channel中数据的积压，如果putList中的数据放不进去，会如何呢?

此时会调用 doRollback 方法，doRollback方法会进行两项操作：将putList清空; 抛出 ChannelException异常。source会捕捉到doRollback抛出的异常，然后source就将刚才的一批数据重新采集，然后重新开始一个新的事务，这就是事务的回滚。

2 Flume Agent内部原理

1637065765693

重要组件：

1）ChannelSelector

ChannelSelector的作用就是选出Event将要被发往哪个Channel。其共有两种类型，分别是Replicating（复制）和Multiplexing（多路复用）。

ReplicatingSelector会将同一个Event发往所有的Channel，Multiplexing会根据相应的原则，将不同的Event发往不同的Channel。

2）SinkProcessor

SinkProcessor共有三种类型，分别是DefaultSinkProcessor、LoadBalancingSinkProcessor和FailoverSinkProcessor

DefaultSinkProcessor对应的是单个的Sink，LoadBalancingSinkProcessor和FailoverSinkProcessor对应的是Sink Group，

LoadBalancingSinkProcessor可以实现负载均衡的功能：利用一定算法将channel均衡的分配到sink上；

FailoverSinkProcessor可以错误恢复的功能：三个中只有一个是Active的sink,如果当前acticve的sink故障了，另外两台通过选举的方式上位，成为新的sink（选举的指标是配置文件中a1.sinkgroups.g1.processor.priority.k1）。

3 Flume拓扑结构

在实际应用中，根据不同的场景，可能部署多个flume实现功能，很多个flume使用的搭配结果如下：

3.1 简单串联

这种模式是将多个flume顺序连接起来了，从最初的source开始到最终sink传送的目的存储系统。此模式不建议桥接过多的flume数量， flume数量过多不仅会影响传输速率，而且一旦传输过程中某个节点flume宕机，会影响整个传输系统。

3.2 复制和多路复用

Flume支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个channel中，或者将不同数据分发到不同的channel中，sink可以选择传送到不同的目的地。

3.3 负载均衡和故障转移

Flume支持使用将多个sink逻辑上分到一个sink组，sink组配合不同的SinkProcessor可以实现负载均衡和错误恢复的功能。

3.4 聚合

这种模式是我们最常见的，也非常实用，日常web应用通常分布在上百个服务器，大者甚至上千个、上万个服务器。产生的日志，处理起来也非常麻烦。用flume的这种组合方式能很好的解决这一问题，每台服务器部署一个flume采集日志，传送到一个集中收集日志的flume，再由此flume上传到hdfs、hive、hbase等，进行日志分析。

4 Flume开发案例

4.1 复制

1）案例需求

使用Flume-1监控文件变动，Flume-1将变动内容传递给Flume-2，Flume-2负责存储到HDFS。同时Flume-1将变动内容传递给Flume-3，Flume-3负责输出到Local FileSystem。

2）需求分析：

具体类型选择如下图所示：

3）实现步骤：

（1）准备工作

在/opt/module/flume/job目录下创建group1文件夹

[molly@hadoop102 job]$ cd group1/

在/opt/module/datas/目录下创建flume3文件夹

[molly@hadoop102 datas]$ mkdir flume3

（2）创建flume-file-flume.conf

配置1个接收日志文件的source和两个channel、两个sink，分别输送给flume-flume-hdfs和flume-flume-dir。

编辑配置文件

[molly@hadoop102 group1]$ vim flume-file-flume.conf

添加如下内容

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有channel
a1.sources.r1.selector.type = replicating
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
# sink端的avro是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102 
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

（3）创建flume-flume-hdfs.conf

配置上级Flume输出的Source，输出是到HDFS的Sink。

编辑配置文件

[molly@hadoop102 group1]$ vim flume-flume-hdfs.conf

添加如下内容

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
# source端的avro是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop102:8020/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a2.sinks.k1.hdfs.rollCount = 0
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

（4）创建flume-flume-dir.conf

配置上级Flume输出的Source，输出是到本地目录的Sink。

编辑配置文件

[molly@hadoop102 group1]$ vim flume-flume-dir.conf

添加如下内容

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/data/flume3
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

提示：输出的本地目录必须是已经存在的目录，如果该目录不存在，并不会创建新的目录。

（5）执行配置文件

分别启动对应的flume进程：flume-flume-dir，flume-flume-hdfs，flume-file-flume。

[molly@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group1/flume-flume-dir.conf

[molly@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group1/flume-flume-hdfs.conf

[molly@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group1/flume-file-flume.conf

（6）启动Hadoop和Hive

[molly@hadoop102 hadoop-2.7.2]$ sbin/start-dfs.sh
[molly@hadoop103 hadoop-2.7.2]$ sbin/start-yarn.sh
[molly@hadoop102 hive]$ bin/hive
hive (default)>

（7）检查HDFS上数据

（8）检查/opt/module/datas/flume3目录中数据

[molly@hadoop102 flume3]$ ll
-rw-rw-r--. 1 molly molly 5942 5月 22 00:09 1526918887550-3

4.2 多路复用+自定义Interceptor

1)案例需求

使用Flume采集服务器本地日志，需要按照日志类型的不同，将不同种类的日志发往不同的分析系统。

需求分析

在实际的开发中，一台服务器产生的日志类型可能有很多种，不同类型的日志可能需要发送到不同的分析系统。此时会用到Flume拓扑结构中的Multiplexing结构，Multiplexing的原理是，根据event中Header的某个key的值，将不同的event发送到不同的Channel中，所以我们需要自定义一个Interceptor，为不同类型的event的Header中的key赋予不同的值。

需求1 架构：在该案例中，我们以端口数据模拟日志，以数字（单个）和字母（单个）模拟不同类型的日志，我们需要自定义interceptor区分数字和字母，将其分别发往不同的分析系统（Channel）。

1637118002481

需求2架构：

1637068504415

需求1实现步骤

主要分为两大步：

第一使用拦截器对数据进行处理：对不同数据添加不同的head

第二：使用选择器将不同head的数据分发到不同的channel

（1）创建一个maven项目，并引入以下依赖。

<dependency>
   <groupId>org.apache.flume</groupId>
   <artifactId>flume-ng-core</artifactId>
   <version>1.9.0</version>
 </dependency>

（2）定义CustomInterceptor类并实现Interceptor接口。将写好的代码打包jar，并放到flume的lib目录（/opt/module/flume）下。

package com.atguigu.flume.interceptor;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.util.List;
public class CustomInterceptor implements Interceptor &#123;
    @Override
    public void initialize() &#123;

    &#125;

    @Override
    public Event intercept(Event event) &#123;

        byte[] body = event.getBody();
        if (body[0] < 'z' && body[0] > 'a') &#123;
            event.getHeaders().put("type", "letter");
        &#125; else if (body[0] > '0' && body[0] < '9') &#123;
            event.getHeaders().put("type", "number");
        &#125;
        return event;

    &#125;

    @Override
    public List<Event> intercept(List<Event> events) &#123;
        for (Event event : events) &#123;
            intercept(event);
        &#125;
        return events;
    &#125;

    @Override
    public void close() &#123;

    &#125;

    public static class Builder implements Interceptor.Builder &#123;

        @Override
        public Interceptor build() &#123;
            return new CustomInterceptor();
        &#125;

        @Override
        public void configure(Context context) &#123;
        &#125;
    &#125;
&#125;

（3）编辑flume配置文件

为hadoop102上的Flume1配置1个netcat source，1个sink group（2个avro sink），并配置相应的ChannelSelector和interceptor。根据下面的配置，head里是letter就放c1,header是number就放c2。

a1.sources.r1.selector.mapping.letter = c1

a1.sources.r1.selector.mapping.number = c2

完整配置如下：

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.CustomInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.letter = c1
a1.sources.r1.selector.mapping.number = c2
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141

a1.sinks.k2.type=avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4242

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100


# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

为hadoop103上的Flume4配置一个avro source和一个logger sink。

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 4141

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

为hadoop104上的Flume3配置一个avro source和一个logger sink。

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 4242

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

（4）分别在hadoop102，hadoop103，hadoop104上启动flume进程，注意先后顺序。

（5）在hadoop102使用netcat向localhost:44444发送字母和数字。

（6）观察hadoop103和hadoop104打印的日志。

4.3 负载均衡

负载均衡片处理器提供在多个Sink之间负载平衡的能力。实现支持通过round_robin（轮询）或者random（随机）参数来实现负载分发

默认情况下使用round_robin，但可以通过配置覆盖这个默认值。还可以通过集成AbstractSinkSelector类来实现用户自己的选择机制。

当被调用的时候，这选择器通过配置的选择规则选择下一个sink来调用。

1）案例需求

使用Flume1监控一个端口，将监控到的内容通过轮询或者随机的方式给到flume2和flume3。Flume2和Flume3将内容打印到控制台。这个时候需要使用LoadBalancingSinkProcessor。

2）架构分析：

具体架构选择见下图。

1637067365266

3）实现步骤

（1）准备工作

在/opt/module/flume/job目录下创建group3文件夹

[molly@hadoop102 job]$ cd group3/

（2）创建flume-netcat-flume.conf：flume1的配置

配置1个netcat source和1个channel、1个sink group（2个sink），分别输送给flume-flume-console2和flume-flume-console3。

编辑配置文件

#负载均衡
#配置Agent a1的组件
a1.sources=r1
a1.channels=c1
a1.sinks=s1 s2

#配置a1的source
a1.sources.r1.type=netcat
a1.sources.r1.bind=0.0.0.0
a1.sources.r1.port=3333

##配置a1的channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.trancactionCapacity=100

#配置a1的sink
a1.sinks.s1.type=avro
a1.sinks.s1.hostname=node2
a1.sinks.s1.port=8888

a1.sinks.s2.type=avro
a1.sinks.s2.hostname=node3
a1.sinks.s2.port=8888

#配置sink组以及sink处理器运行
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = s1 s2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
#参数可选round_robin（轮询）或者random（随机）
a1.sinkgroups.g1.processor.selector = round_robin

#绑定
a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1
a1.sinks.s2.channel=c1

（3）flume2和flume3的配置：两个配置是一样的.flume-flume-console2，flume-flume-console1

# 配置Agent a1的组件
a1.sources=r1
a1.channels=c1
a1.sinks=s1
# 配置a1的source
a1.sources.r1.type=avro
a1.sources.r1.bind=0.0.0.0
a1.sources.r1.port=8888
# 配置a1的channel
a1.channels.c1.type=memory
# 配置a1的sink
a1.sinks.s1.type=logger
# 绑定
a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

（4）执行配置文件

分别开启对应配置文件：flume-flume-console2，flume-flume-console1，flume-netcat-flume。

[molly@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group2/flume-flume-console2.conf -Dflume.root.logger=INFO,console
[molly@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console
[molly@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group2/flume-netcat-flume.conf

（6）使用netcat工具向本机的44444端口发送内容

$ nc localhost 44444

（7）查看Flume2及Flume3的控制台打印日志

4.4 故障转移

1）案例需求

使用Flume1监控一个端口，其sink组中的sink分别对接Flume2和Flume3，采用FailoverSinkProcessor，实现故障转移的功能。

2）需求分析

1637067455399

1637116934865 3）实现步骤

（1）准备工作

在/opt/module/flume/job目录下创建group2文件夹

[molly@hadoop102 job]$ cd group2/

（2）创建flume-netcat-flume.conf

配置1个netcat source和1个channel、1个sink group（2个sink），分别输送给flume-flume-console1和flume-flume-console2。

编辑配置文件

[molly@hadoop102 group2]$ vim flume-netcat-flume.conf
#添加如下内容
# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#设置sinkgroups processor为failover
a1.sinkgroups.g1.processor.type = failover
#两个sink，k1与k2,其中2个优先级是5和10
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
#failover time的上限可以通过maxpenalty 属性来进行设置默认10s。
a1.sinkgroups.g1.processor.maxpenalty = 10000
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

（3）创建flume-flume-console1.conf

配置上级Flume输出的Source，输出是到本地控制台。

编辑配置文件

[molly@hadoop102 group2]$ vim flume-flume-console1.conf

#添加如下内容
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = logger
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

（4）创建flume-flume-console2.conf

配置上级Flume输出的Source，输出是到本地控制台。

编辑配置文件

[molly@hadoop102 group2]$ vim flume-flume-console2.conf
添加如下内容
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

（5）执行配置文件

分别开启对应配置文件：flume-flume-console2，flume-flume-console1，flume-netcat-flume。

[molly@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group2/flume-flume-console2.conf -Dflume.root.logger=INFO,console
[molly@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console
[molly@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group2/flume-netcat-flume.conf

（6）使用netcat工具向本机的44444端口发送内容

$ nc localhost 44444

（7）查看Flume2及Flume3的控制台打印日志

因为k1的优先级是5，K2是10因此当K2正常运行的时候，是发送到K2的。

（8）将Flume2 kill，观察Flume3的控制台打印情况。

我们发现源代理发生事件到K2失败，然后他将K2放入到failover list（故障列表）

因为K1还是正常运行的，因此这个时候他会接收到数据。

注：使用jps -ml查看Flume进程。

4.5聚合

1）案例需求：

hadoop102上的Flume-1监控文件/opt/module/group.log，

hadoop103上的Flume-2监控某一个端口的数据流，

Flume-1与Flume-2将数据发送给hadoop104上的Flume-3，Flume-3将最终数据打印到控制台。

2）需求分析

3）实现步骤：

（1）准备工作

分发Flume

[molly@hadoop102 module]$ xsync flume

在hadoop102、hadoop103以及hadoop104的/opt/module/flume/job目录下创建一个group3文件夹。

[molly@hadoop102 job]$ mkdir group3

[molly@hadoop103 job]$ mkdir group3

[molly@hadoop104 job]$ mkdir group3

（2）创建flume1-logger-flume.conf

配置Source用于监控hive.log文件，配置Sink输出数据到下一级Flume。

在hadoop102上编辑配置文件

[molly@hadoop102 group3]$ vim flume1-logger-flume.conf

添加如下内容

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F /opt/module/group.log

a1.sources.r1.shell = /bin/bash -c

# Describe the sink

a1.sinks.k1.type = avro

a1.sinks.k1.hostname = hadoop104

a1.sinks.k1.port = 4141

# Describe the channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

（3）创建flume2-netcat-flume.conf

配置Source监控端口44444数据流，配置Sink数据到下一级Flume：

在hadoop103上编辑配置文件

[molly@hadoop102 group3]$ vim flume2-netcat-flume.conf

添加如下内容

# Name the components on this agent

a2.sources = r1

a2.sinks = k1

a2.channels = c1

# Describe/configure the source

a2.sources.r1.type = netcat

a2.sources.r1.bind = hadoop103

a2.sources.r1.port = 44444

# Describe the sink

a2.sinks.k1.type = avro

a2.sinks.k1.hostname = hadoop104

a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory

a2.channels.c1.type = memory

a2.channels.c1.capacity = 1000

a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a2.sources.r1.channels = c1

a2.sinks.k1.channel = c1

（4）创建flume3-flume-logger.conf

配置source用于接收flume1与flume2发送过来的数据流，最终合并后sink到控制台。

在hadoop104上编辑配置文件

[molly@hadoop104 group3]$ touch flume3-flume-logger.conf

[molly@hadoop104 group3]$ vim flume3-flume-logger.conf

添加如下内容

# Name the components on this agent

a3.sources = r1

a3.sinks = k1

a3.channels = c1

# Describe/configure the source

a3.sources.r1.type = avro

a3.sources.r1.bind = hadoop104

a3.sources.r1.port = 4141

# Describe the sink

a3.sinks.k1.type = logger

# Describe the channel

a3.channels.c1.type = memory

a3.channels.c1.capacity = 1000

a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a3.sources.r1.channels = c1

a3.sinks.k1.channel = c1

（5）执行配置文件

分别开启对应配置文件：flume3-flume-logger.conf，flume2-netcat-flume.conf，flume1-logger-flume.conf。

[molly@hadoop104 flume]$ bin/flume-ng agent –conf conf/ –name a3 –conf-file job/group3/flume3-flume-logger.conf -Dflume.root.logger=INFO,console

[molly@hadoop102 flume]$ bin/flume-ng agent –conf conf/ –name a2 –conf-file job/group3/flume1-logger-flume.conf

[molly@hadoop103 flume]$ bin/flume-ng agent –conf conf/ –name a1 –conf-file job/group3/flume2-netcat-flume.conf

（6）在hadoop103上向/opt/module目录下的group.log追加内容

[molly@hadoop103 module]$ echo ‘hello’ > group.log

（7）在hadoop102上向44444端口发送数据

[molly@hadoop102 flume]$ telnet hadoop102 44444

（8）检查hadoop104上数据

本文标题:flume学习笔记（二） flum事务和部署架构解析

文章作者:m01ly

发布时间:2020-11-15, 16:46:51

最后更新:2021-11-17, 14:30:23

原始链接:https://m01ly.github.io/2020/11/15/bigdata-flume2-framework/

许可协议: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。