
Hive是一个数据仓库基础工具在Hadoop中用来处理结构化数据。它架构在Hadoop之上,总归为大数据,并使得查询和分析方便。并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。
如果没有Hive:使用者->mapreduce->hadoop数据(使用者需要会复杂的mapreduce)
有hive:使用者->HQL(SQL)->hive->mapreduce->hadoop数据(只需要会SQL语句)
下面介绍hive的安装及配置过程
Hive安装前需要安装JDK和Hadoop。使用mysql来存储元数据,则需要安装mysql。
启动好Hadoop,可以参考我的文章《Hadoop入门:设置单节点集群。》
1.下载安装包,推荐大家使用清华大学免费软件镜像下载地址,速度贼快:
https://mirrors.tuna.tsinghua.edu.cn/apache/hive/

我就下载了最新版hive-3.1.3,Linux直接用命令下载方便点:
wget http://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
2.解压
tar zxvf apache-hive-3.1.3-bin.tar.gz
3.修改全局环境变量/etc/profile或者配置某个用户的环境变量,因配置全局环境变量没有权限,下面展示配置用户环境变量
第一步:进入用户环境变量文件:vim ~/.bashrc
末尾追加内容:
export HIVE_HOME=/apache-hive-3.1.3-bin
export PATH=$PATH:$HIVE_HOME/bin
第二步:输入 source ~/.bashrc 使环境变量生效
第三步:输入export命令,查看新增的环境变量
4.修改hive配置文件
进入配置文件所在目录,然后添加Hive核心配置,命名为hive-site.xml
cd apache-hive-3.1.3-bin/conf
vim hive-site.xml
主要配置MySQL连接信息,用来存元数据:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/test?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
<!-- 如果 mysql 和 hive 不在同一个服务器节点,那么请更改 localhost 为 对应ip -->
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
<description>password to use against metastore database</description>
</property>
</configuration>
以下可选配置,指定 Hive 数据仓库的数据存储在 HDFS 上的目录
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/hive/warehouse</value>
<description>hive default warehouse, if nessecory, change it</description>
</property>
接下来配置hive-env.sh
cp hive-env.sh.template hive-env.sh
vim hive-env.sh
# hadoop的安装目录
HADOOP_HOME=/home/hadoop-3.3.4
5. hive 的根路径下的 lib 目录加入 MySQL 驱动包
1.下载MySQL驱动包
wget --no-check-certificate https://mirrors.tuna.tsinghua.edu.cn/mysql/downloads/Connector-J/mysql-connector-java-5.1.49.tar.gz
2.解压
tar zxvf mysql-connector-java-5.1.49.tar.gz
3.复制MySQL的驱动包到hive的lib文件夹下
cp ~/mysql-connector-java-5.1.49/mysql-connector-java-5.1.49-bin.jar /home/apache-hive-3.1.3-bin/lib
6.初始化元数据(注意:在此之前需要创建mysql下的hive数据库)
schematool -dbType mysql -initSchema

7.使用hive
在bin目录下执行
./hive

基本使用
1.创建一个数据库dw
hive> create database dw;
OK
Time taken: 0.131 seconds
创建成功后可以在hadoop的web页面看到:

2.使用新的数据库dw
hive> use dw;
OK
Time taken: 0.03 seconds
3.查看当前正在使用的数据库
hive> select current_database();
OK
dw
Time taken: 0.184 seconds, Fetched: 1 row(s)
4.建表
hive> create table `erp_user`(id int, login_name string);
OK
Time taken: 0.409 seconds
5.插入数据
hive> insert into erp_user values(1, 'test');
Query ID = tomtop2149_20230201110137_eff737fe-12d8-4de8-a273-6212aaa3210d
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1675147712868_0001, Tracking URL = http://localhost:8088/proxy/application_1675147712868_0001/
Kill Command = /home/tomtop2149/hadoop-3.3.4/bin/mapred job -kill job_1675147712868_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2023-02-01 11:01:47,161 Stage-1 map = 0%, reduce = 0%
2023-02-01 11:01:52,339 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.87 sec
2023-02-01 11:01:56,473 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.11 sec
MapReduce Total cumulative CPU time: 5 seconds 110 msec
Ended Job = job_1675147712868_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://ip:9000/hive/warehouse/dw.db/erp_user/.hive-staging_hive_2023-02-01_11-01-37_552_7185416133968794842-1/-ext-10000
Loading data to table dw.erp_user
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.11 sec HDFS Read: 15482 HDFS Write: 237 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 110 msec
OK
Time taken: 21.67 seconds
6.查询
hive> select * from erp_user;
OK
1 test
Time taken: 0.149 seconds, Fetched: 1 row(s)
简单安装就到这了,后续使用持续研究中。。。