基于Docker搭建Hadoop+Hive

Z技术 2021年04月30日 1,940次浏览

原文件地址:基于Docker搭建Hadoop+Hive - 知乎

为配合生产hadoop使用,在本地搭建测试环境,使用docker环境实现(主要是省事~),拉取阿里云已有hadoop镜像基础上,安装hive组件,参考下面两个专栏文章:

克里斯:基于 Docker 构建 Hadoop 平台
docker上从零开始搭建hadoop和hive环境

由于hadoop与hive等存在版本兼容问题,安装前可以先通过官网确认版本兼容情况:

http://hive.apache.org/downloads.html

本次使用的各版本配置如下:

  • Docker 19.03.8- JDK 1.8- Hadoop 3.2.0- Hive 3.1.2- mysql 8.0.1- mysql-connector-java-5.1.49.jar- hive_jdbc_2.5.15.1040

Hadoop部分:

一、拉取镜像

  1. docker pull registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base

二、运行容器

进入容器看worker里面有三台机子,分别是Master、Slave1、Slave2

关于worker路径,通过etc/profile环境变量配置的文件即可查看hadoop安装目录

  1. #随意建立个容器查看配置情况
  2. docker run -it --name hadopp-test registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base
  3. #查看系统变量路径
  4. vim etc/profile
  5. #查看worker情况
  6. vim /usr/local/hadoop/etc/hadoop/workers
  • 建立hadoop用的内部网络
  1. #指定固定ip号段
  2. docker network create --driver=bridge --subnet=172.19.0.0/16 hadoop
  • 建立Master容器,映射端口
    10000端口为hiveserver2端口,后面本地客户端要通过beeline连接hive使用,有其他组件要安装的话可以提前把端口都映射出来,毕竟后面容器运行后再添加端口还是有点麻烦的
  1. docker run -it --network hadoop -h Master --name Master -p 9870:9870 -p 8088:8088 -p 10000:10000 registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base bash
  • 创建Slave1容器
  1. docker run -it --network hadoop -h Slave1 --name Slave1 registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base bash
  • 创建Slave2容器
  1. docker run -it --network hadoop -h Slave2 --name Slave2 registry.cn-hangzhou.aliyuncs.com/hadoop_test/hadoop_base bash
  • 三台机器,都修改host vim /etc/hosts
  1. 172.20.0.4 Master
  2. 172.20.0.3 Slave1
  3. 172.20.0.2 Slave2

三、启动hadoop

虽然容器里面已经把hadoop路径配置在系统变量里面,但每次进入需要运行source /etc/profile才能生效使用

  • 进入master,启动hadoop,先格式化hdfs
  1. #进入Master容器
  2. docker exec -it Master bash
  3. #进入后格式化hdfs
  4. root@Master:/# hadoop namenode -format
  • 启动全部,包含hdfs和yarn
  1. root@Master:/usr/local/hadoop/sbin# ./start-all.sh

可以看到服务起来了,本地范围宿主机ip的8088及9870端口可以看到监控信息

  1. Starting namenodes on [Master]
  2. Master: Warning: Permanently added 'master,172.19.0.4' (ECDSA) to the list of known hosts.
  3. Starting datanodes
  4. Slave1: Warning: Permanently added 'slave1,172.19.0.3' (ECDSA) to the list of known hosts.
  5. Slave2: Warning: Permanently added 'slave2,172.19.0.2' (ECDSA) to the list of known hosts.
  6. Slave1: WARNING: /usr/local/hadoop/logs does not exist. Creating.
  7. Slave2: WARNING: /usr/local/hadoop/logs does not exist. Creating.
  8. Starting secondary namenodes [Master]
  9. Starting resourcemanager
  10. Starting nodemanagers

e55f11fe8c3cbbbd8597bc738cf04c06.png

查看分布式文件系统状态

  1. root@Master:/usr/local/hadoop/sbin# hdfs dfsadmin -report

四、运行内置WordCount例子

把license作为需要统计的文件

  1. root@Master:/usr/local/hadoop# cat LICENSE.txt > file1.txt
  2. root@Master:/usr/local/hadoop# ls
  3. LICENSE.txt NOTICE.txt README.txt bin etc file1.txt include lib libexec logs sbin share

在 HDFS 中创建 input 文件夹

  1. root@Master:/usr/local/hadoop# hadoop fs -mkdir /input

上传 file1.txt 文件到 HDFS 中

  1. root@Master:/usr/local/hadoop# hadoop fs -put file1.txt /input
  2. 2020-09-14 11:02:01,183 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

查看 HDFS 中 input 文件夹里的内容

  1. root@Master:/usr/local/hadoop# hadoop fs -ls /input
  2. Found 1 items
  3. -rw-r--r-- 2 root supergroup 150569 2020-09-14 11:02 /input/file1.txt

运作 wordcount 例子程序

  1. root@Master:/usr/local/hadoop# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount /input /output

查看 HDFS 中的 /output 文件夹的内容

  1. root@Master:/usr/local/hadoop# hadoop fs -ls /output
  2. Found 2 items
  3. -rw-r--r-- 2 root supergroup 0 2020-09-14 11:09 /output/_SUCCESS
  4. -rw-r--r-- 2 root supergroup 35324 2020-09-14 11:09 /output/part-r-00000

查看part-r-00000文件的内容,就是运行的结果

  1. root@Master:/usr/local/hadoop# hadoop fs -cat /output/part-r-00000

Hadoop 部分结束了

HIVE部分:

Hive下载后上传到容器目录下:

下载地址:Index of /apache/hive/hive-3.1.2

                  https://mirror-hk.koddos.net/apache/hive/hive-3.1.2/

一、解压安装包

  1. # 拷贝安装包进Master容器
  2. docker cp apache-hive-3.1.2-bin.tar.gz Master:/usr/local
  3. # 进入容器
  4. docker exec -it Master bash
  5. cd /usr/local/
  6. # 解压安装包
  7. tar xvf apache-hive-3.1.2-bin.tar.gz

二、修改配置文件

  1. root@Master:/usr/local/apache-hive-3.1.2-bin/conf# cp hive-default.xml.template hive-site.xml
  2. root@Master:/usr/local/apache-hive-3.1.2-bin/conf# vim hive-site.xml

在最前面添加下面配置:

  1. <property>
  2. <name>system:java.io.tmpdir</name>
  3. <value>/tmp/hive/java</value>
  4. </property>
  5. <property>
  6. <name>system:user.name</name>
  7. <value>${user.name}</value>
  8. </property>

三、配置Hive相关环境变量

  1. vim /etc/profile

  2. #文本最后添加
  3. export HIVE_HOME="/usr/local/apache-hive-3.1.2-bin"
  4. export PATH=$PATH:$HIVE_HOME/bin

配置后执行source /etc/profile 生效

  1. source /etc/profile

四、配置mysql作为元数据库

1.拉取Mysql镜像并生产容器:

  1. #拉取镜像
  2. docker pull mysql:8:0.18
  3. #建立容器
  4. docker run --name mysql_hive -p 4306:3306 --net hadoop --ip 172.19.0.5 -v /root/mysql:/var/lib/mysql -e MYSQL_ROOT_PASSWORD=abc123456 -d mysql:8.0.18
  5. #进入容器
  6. docker exec -it mysql_hive bash
  7. #进入myslq
  8. mysql -uroot -p
  9. #密码上面建立容器时候已经设置abc123456
  10. #建立hive数据库
  11. create database hive;
  12. #修改远程连接权限
  13. ALTER USER 'root'@'%' IDENTIFIED WITH mysql_native_password BY 'abc123456';

ps:容器ip地址类似window的分配机制,如果不设置固定ip容器重启后会自动分配ip,因此数据库相关的容器建议设置为固定ip。

2.回去Master容器,修改关联数据库的配置

  1. docker exec -it Master bash
  2. vim /usr/local/apache-hive-3.1.2-bin/conf/hive-site.xml

搜索关键词修改数据库url、驱动、用户名,url根据上面建容器时候地址。

  1. #还请注意hive配置文件里面使用&作为分隔,高版本myssql需要SSL验证,在这里设置关闭
  2. <property>
  3. <name>javax.jdo.option.ConnectionUserName</name>
  4. <value>root</value>
  5. </property>
  6. <property>
  7. <name>javax.jdo.option.ConnectionPassword</name>
  8. <value>abc123456</value>
  9. </property>
  10. <property>
  11. <name>javax.jdo.option.ConnectionURL</name>
  12. <value>jdbc:mysql://172.19.0.5:3306/hive?createDatabaseIfNotExist=true&useSSL=false</value>
  13. </property>
  14. <property>
  15. <name>javax.jdo.option.ConnectionDriverName</name>
  16. <value>com.mysql.jdbc.Driver</value>
  17. </property>
  18. <property>
  19. <name>hive.metastore.schema.verification</name>
  20. <value>false</value>
  21. <property>

3.mysql驱动上传到hive的lib下

  1. #前面已经跟hive安装包一起上传到容器/usr/local目录
  2. root@Master:/usr/local# cp mysql-connector-java-5.1.49.jar /usr/local/apache-hive-3.1.2-bin/lib

4.jar包配置修改

对hive的lib文件夹下的部分文件做修改,不然初始化数据库的时候会报错

  1. #slf4j这个包hadoop及hive两边只能有一个,这里删掉hive这边
  2. root@Master:/usr/local/apache-hive-3.1.2-bin/lib# rm log4j-slf4j-impl-2.10.0.jar

  3. #guava这个包hadoop及hive两边只删掉版本低的那个,把版本高的复制过去,这里删掉hive,复制hadoop的过去
  4. root@Master:/usr/local/hadoop/share/hadoop/common/lib# cp guava-27.0-jre.jar /usr/local/apache-hive-3.1.2-bin/lib
  5. root@Master:/usr/local/hadoop/share/hadoop/common/lib# rm /usr/local/apache-hive-3.1.2-bin/lib/guava-19.0.jar

  6. #把文件hive-site.xml第3225行的特殊字符删除
  7. root@Master: vim /usr/local/apache-hive-3.1.2-bin/conf/hive-site.xml

五、初始化元数据库

  1. root@Master:/usr/local/apache-hive-3.1.2-bin/bin# schematool -initSchema -dbType mysql

成功后提示:

  1. Metastore connection URL: jdbc:mysql://172.19.0.5:3306/hive?createDatabaseIfNotExist=true&useSSL=false
  2. Metastore Connection Driver : com.mysql.jdbc.Driver
  3. Metastore connection User: root
  4. Starting metastore schema initialization to 3.1.0
  5. Initialization script hive-schema-3.1.0.mysql.sql
  6. Initialization script completed
  7. schemaTool completed

六、验证

  • 我们先创建一个数据文件放到/usr/local
  1. cd /usr/local
  2. vim test.txt

1,jack

2,hel

3,nack

  • 进入hive交互界面
  1. root@Master:/usr/local# hive
  2. Hive Session ID = 7bec2ab6-e06d-4dff-8d53-a64611875aeb

  3. Logging initialized using configuration in jar:file:/usr/local/apache-hive-3.1.2-bin/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
  4. Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  5. Hive Session ID = 5cdee915-5c95-4834-bd3b-bec6f3d90e5b
  6. hive> create table test(
  7. > id int
  8. > ,name string
  9. > )
  10. > row format delimited
  11. > fields terminated by ',';
  12. OK
  13. Time taken: 1.453 seconds
  14. hive> load data local inpath '/usr/local/test.txt' into table test;
  15. Loading data to table default.test
  16. OK
  17. Time taken: 0.63 seconds
  18. hive> select * from test;
  19. OK
  20. 1 jack
  21. 2 hel
  22. 3 nack
  23. Time taken: 1.611 seconds, Fetched: 3 row(s)

hive安装完毕

启动 Hiveserver2

一、修改hadoop的一些权限配置

  1. root@Master:/usr/local# vim /usr/local/hadoop/etc/hadoop/core-site.xml
  • 加入以下配置
  1. <property>
  2. <name>hadoop.proxyuser.root.hosts</name>
  3. <value>*</value>
  4. </property>
  5. <property>
  6. <name>hadoop.proxyuser.root.groups</name>
  7. <value>*</value>
  8. </property>
  • 重启hdfs:
  1. root@Master:/usr/local/hadoop/sbin# ./stop-dfs.sh
  2. Stopping namenodes on [Master]
  3. Stopping datanodes
  4. Stopping secondary namenodes [Master]
  5. root@Master:/usr/local/hadoop/sbin# ./start-dfs.sh
  6. Starting namenodes on [Master]
  7. Starting datanodes
  8. Starting secondary namenodes [Master]

二、后台启动hiveserver2

  1. root@Master:/usr/local/hadoop/sbin# nohup hiveserver2 >/dev/null 2>/dev/null &
  2. [2] 7713

三、验证

通过beeline连接并查询

查看10000端口运行正常,beeline命令,!connect连接,结果查询正常

  1. !connect jdbc:hive2://localhost:10000/default
  1. root@Master:/usr/local/hadoop/sbin# netstat -ntulp |grep 10000
  2. tcp 0 0 0.0.0.0:10000 0.0.0.0:* LISTEN 7388/java
  3. root@Master:/usr/local/hadoop/sbin# beeline
  4. Beeline version 3.1.2 by Apache Hive
  5. beeline> !connect jdbc:hive2://localhost:10000/default
  6. Connecting to jdbc:hive2://localhost:10000/default
  7. Enter username for jdbc:hive2://localhost:10000/default: root
  8. Enter password for jdbc:hive2://localhost:10000/default: *********
  9. Connected to: Apache Hive (version 3.1.2)
  10. Driver: Hive JDBC (version 3.1.2)
  11. Transaction isolation: TRANSACTION_REPEATABLE_READ
  12. 0: jdbc:hive2://localhost:10000/default> select * from test;
  13. INFO : Compiling command(queryId=root_20200915075948_0672fa16-435e-449c-9fcd-f71fcdf6841c): select * from test
  14. INFO : Concurrency mode is disabled, not creating a lock manager
  15. INFO : Semantic Analysis Completed (retrial = false)
  16. INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:test.id, type:int, comment:null), FieldSchema(name:test.name, type:string, comment:null)], properties:null)
  17. INFO : Completed compiling command(queryId=root_20200915075948_0672fa16-435e-449c-9fcd-f71fcdf6841c); Time taken: 2.584 seconds
  18. INFO : Concurrency mode is disabled, not creating a lock manager
  19. INFO : Executing command(queryId=root_20200915075948_0672fa16-435e-449c-9fcd-f71fcdf6841c): select * from test
  20. INFO : Completed executing command(queryId=root_20200915075948_0672fa16-435e-449c-9fcd-f71fcdf6841c); Time taken: 0.004 seconds
  21. INFO : OK
  22. INFO : Concurrency mode is disabled, not creating a lock manager
  23. +----------+------------+
  24. | test.id | test.name |
  25. +----------+------------+
  26. | 1 | jack |
  27. | 2 | hel |
  28. | 3 | nack |
  29. +----------+------------+
  30. 3 rows selected (3.194 seconds)
  31. 0: jdbc:hive2://localhost:10000/default>

Hiveserver2配置完毕

本地使用SQL Developer连接Hive

工具oracle官网下载即可,免费,驱动到cloudera官网下,在工具第三方驱动处安装

Oracle SQL Developer 下载


Download Hive JDBC Driver 2.5.15

935e42aa155d1ad0bf0003376689e90a.png

40012198de74de7238a9fc01f3c566a3.png

IP为宿主机的IP,端口则是一开始建立容器映射的10000端口

e3b075193eef13047db62502b35d9efb.png

可以看到前面验证的test的表

镜像已上传阿里云:

registry.cn-shenzhen.aliyuncs.com/rango/myhadoop

 更多信息请关注公众号:
20220401152838