Hive学习笔记（三） Hive的分区表和分桶表

总字数2.4k 预计阅读10分钟

在查找大数据的时候，检索是通过检索所有数据，效率很慢，因此合理规划数据的存储尤其重要，例如可以根据日期进行分区存储（即分目录去存储）,或者使用分桶去切分数据文件进行存储。（注意这里说的分区和上篇的查找排序分区是不同的概念，上面的排序分区是存储好了去查找；这里的分区表是指在怎么分区去存储）

1 分区表

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

1.1 分区表基本操作

1）引入分区表（需要根据日期对日志进行管理, 通过部门信息模拟）

dept_20200401.log
dept_20200402.log
dept_20200403.log
……

2）创建分区表语法：通过partitioned去指定根据day字段去进行分区

hive (default)> create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';

3）加载数据到分区表中

（1）数据准备

dept_20200401.log
10 ACCOUNTING 1700
20 RESEARCH  1800
dept_20200402.log
30 SALES  1900
40 OPERATIONS 1700
dept_20200403.log
50 TEST  2000
60 DEV 1900

（2）加载数据-通过dept_partition partition(day=’20200402’)去指定导入的具体分区

hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table dept_partition partition(day='20200401');
hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200402.log' into table dept_partition partition(day='20200402');

注意：分区表加载数据时，必须指定分区

Hdfs中文件如下：

1636982808531

4）查询分区表中数据

单分区查询-根据day去筛选

hive (default)> select * from dept_partition where day='20200401';

多分区联合查询

hive (default)> select * from dept_partition where day='20200401'
       union
       select * from dept_partition where day='20200402'
       union
       select * from dept_partition where day='20200403';
hive (default)> select * from dept_partition where day='20200401' or
        day='20200402' or day='20200403' ;

5）增加分区

创建单个分区

hive (default)> alter table dept_partition add partition(day='20200404') ;

同时创建多个分区

hive (default)> alter table dept_partition add partition(day='20200405') partition(day='20200406');

6）删除分区

删除单个分区

hive (default)> alter table dept_partition drop partition (day='20200406');

同时删除多个分区

hive (default)> alter table dept_partition drop partition (day='20200404'), partition(day='20200405');

7）查看分区表有多少分区

hive> show partitions dept_partition;

8）查看分区表结构

hive> desc formatted dept_partition;

1.2 二级分区

思考: 如何一天的日志数据量也很大，如何再将数据拆分?

1）创建二级分区表

hive (default)> create table dept_partition2(
               deptno int, dname string, loc string
               )
               partitioned by (day string, hour string)
               row format delimited fields terminated by '\t';

2）正常的加载数据

（1）加载数据到二级分区表中

hive (default)> load data local inpath '/opt/module`/hive/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401', hour='12');

（2）查询分区数据

hive (default)> select * from dept_partition2 where day='20200401' and hour='12';

3）把数据直接上传到分区目录上，让分区表和数据产生关联的三种方式

（1）方式一：先上传数据，然后进行修复分区操作，可以查到数据

上传数据

hive (default)> dfs -mkdir -p /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
hive (default)> dfs -put /opt/module/datas/dept_20200401.log /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;

查询数据（查询不到刚上传的数据）

hive (default)> select * from dept_partition2 where day='20200401' and hour='13';

执行分区修复命令

hive> msck repair table dept_partition2;

再次查询数据

hive (default)> select * from dept_partition2 where day='20200401' and hour='13';

（2）方式二：上传数据后，手动添加分区

上传数据

hive (default)> dfs -mkdir -p
 /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
hive (default)> dfs -put /opt/module/hive/datas/dept_20200401.log /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;

执行添加分区

hive (default)> alter table dept_partition2 add partition(day='20200401',hour='14');

查询数据

hive (default)> select * from dept_partition2 where day='20200401' and hour='14';

（3）方式三：创建文件夹后，load数据到分区

创建目录

hive (default)> dfs -mkdir -p
 /user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;

上传数据

hive (default)> load data local inpath '/opt/module/hive/datas/dept_20200401.log' into table
 dept_partition2 partition(day='20200401',hour='15');

查询数据

hive (default)> select * from dept_partition2 where day='20200401' and hour='15';

1.3 动态分区

上面的分区都是我们手动去指定的，但是在实际应用中，对于大数据不可能每条都手动去分区，因此，Hive有个功能叫做动态分区。对比关系型数据库中，对分区表Insert数据时候，数据库自动会根据分区字段的值，将数据插入到相应的分区中，Hive中也提供了类似的机制，即动态分区(Dynamic Partition)，只不过，使用Hive的动态分区，需要进行相应的配置。

1）开启动态分区参数设置**(3-6可以不改，使用默认值)**

（0）查看动态分区功能（默认true，开启）

hive (default)> set hive.exec.dynamic.partition

（1）开启动态分区功能（默认true，开启）

hive (default)> hive.exec.dynamic.partition=true

（2）设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）

hive (default)> hive.exec.dynamic.partition.mode=nonstrict

（3）在所有执行MR的节点上，最大一共可以创建多少个动态分区。默认1000

hive (default)> hive.exec.max.dynamic.partitions=1000

（4）在每个执行MR的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。

hive (default)> hive.exec.max.dynamic.partitions.pernode=100

（5）整个MR Job中，最大可以创建多少个HDFS文件。默认100000

hive (default)> hive.exec.max.created.files=100000

（6）当有空分区生成时，是否抛出异常。一般不需要设置。默认false

hive (default)> hive.error.on.empty.partition=false

2）案例实操

需求：将dept表中的数据按照时间（day字段），插入到目标表dept_partition的相应分区中。

（1）创建目标分区表

hive (default)> create table dept_partition_dy(id int, name string,loc string) partitioned by (day string) row format delimited fields terminated by '\t';

（2）设置动态分区

hive (default)> set hive.exec.dynamic.partition.mode = nonstrict;

（3）插入表数据

方法1：导入本地数据到hive中

hive (default)> load data local inpath '/opt/module/hive-3.4.2/datas/dept_partition_dy.txt' into table dept_partition_dy;

这个时候可以发现启动了mapreduce,因为用到了计算，这里需要注意的是，有可能会报错：找不到dept_partition_dy.txt文件，因为用到了计算，会启用yarn进行分配job，运行该job的不一定在当前机器hadoop102,也有可能在hadoop103。所以为了不出错，我们最好将数据先传到hdfs上，然后去load。

hive (default)> load data inpath '/dept_partition_dy.txt' into table dept_partition_dy;

方法2：通过insert插入

因为低版本的load是不会启用mapreduce计算的，因此我们可以先将数据导入一个普通表，然后通过insert select查询操作将数据导入分区表中。

（a）先将数据导入普通表dept_dy

hive (default)> load data local inpath '/opt/module/hive-3.4.2/datas/dept_partition_dy.txt' into table dept_dy;

（b）通过insert select查询操作将数据导入分区表dept_partition_dy中，因为insert操作会启用计算mapreduce

hive (default)> insert into dept_partition_dy select * from dept_by

（4）查看目标分区表的分区情况和表数据

hive (default)> select * from dept_partition_dy;
hive (default)> show partitions dept_partition;

思考：目标分区表是如何匹配到分区字段的？

要求插入的数据必须要有分区字段，即上面提到的day，这样才能根据数据文件中的day去匹配加入对应的分区。例如导入的文件为：

1636983103553

2 分桶表

分区提供一个隔离数据和优化查询的便利方式。不过，并非所有的数据集都可形成合理的分区。对于一张表或者分区，Hive 可以进一步组织成桶，也就是更为细粒度的数据范围划分。

分桶是将数据集分解成更容易管理的若干部分的另一个技术。

分区针对的是数据的存储路径；分桶针对的是数据文件。

2.1 先创建分桶表

（1）数据准备

（2）创建分桶表—根据字段id去进行分桶，并且分4个桶

create table stu_bucket(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

（3）查看表结构

hive (default)> desc formatted stu_bucket;

（4）导入数据到分桶表中，load的方式

hive (default)> load data inpath  '/student.txt' into table stu_bucket;

（5）查看创建的分桶表中是否分成4个桶

1636983158547

（6）查询分桶的数据

hive(default)> select * from stu_buck;

（7）分桶规则：

根据结果可知：Hive的分桶采用对分桶字段的值进行哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中

2.2 分桶表操作需要注意的事项

（1）reduce的个数设置为-1,让Job自行决定需要用多少个reduce或者将reduce的个数设置为大于等于分桶表的桶数

（2）从hdfs中load数据到分桶表中，避免本地文件找不到问题

（3）不要使用本地模式

2.3 insert方式将数据导入分桶表

hive(default)>insert into table stu_buck select * from student_insert ;

本文标题:Hive学习笔记（三） Hive的分区表和分桶表

文章作者:m01ly

发布时间:2020-11-15, 15:45:51

最后更新:2021-11-16, 14:28:29

原始链接:https://m01ly.github.io/2020/11/15/bigdata-hive3/

许可协议: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。