CentOS集群部署Hadoop、ZooKeeper、Hive等集成环境
首先买了6台阿里云的ECS,装了CentOS 6.5 64-bit,把用户密码什么的都配置完毕。
集群规划如此如此。
主机名 安装的软件 | 运行的进程(跑jps命令) Client101 jdk、hadoop | NameNode、DFSZKFailoverController Client102 jdk、hadoop | NameNode、DFSZKFailoverController Client103 jdk、hadoop、hive、kettle、sqoop | ResourceManager Client104 jdk、hadoop、zookeeper | DataNode、NodeManager、JournalNode、QuorumPeerMain Client105 jdk、hadoop、zookeeper | DataNode、NodeManager、JournalNode、QuorumPeerMain Client106 jdk、hadoop、zookeeper | DataNode、NodeManager、JournalNode、QuorumPeerMain
准备各种素材,比如JDK,Hadoop,ZooKeeper,Hive,等等。统一放到各台机器的/download/目录下去。
接下来整理每一台机器的IP和主机名。阿里云的ECS默认就带着一个iZxxxxZ的主机名,然后为了方便可以自己再根据需要定一个主机名,比如ClientXXX这样。最后全部归结起来,形成一个ip表。
IP1 iZ11111Z IP1 Client101 ...... IPn iZnnnnnZ IPn Client10n
这些得填到/etc/hosts里面。
阿里云的ECS一般不会开iptables防火墙,但是安全起见也先关掉。
#关闭防火墙 service iptables status;service iptables stop #关闭防火墙开机启动 chkconfig iptables --list;chkconfig iptables off
关掉之后重启Linux。
reboot
然后是在每一台机器上按照JDK。默认认为安装包已经放在那个目录了。用解压命令丢到/usr/local下面。
tar -zxvf /download/jdk-7u71-linux-x64.gz -C /usr/local
将java添加到环境变量中
nano /etc/profile
在文件最后添加
export JAVA_HOME=/usr/local/jdk1.7.0_71 export PATH=$PATH:$JAVA_HOME/bin
然后刷新配置
source /etc/profile
验证JDK配置成功
java -version
然后是时候装hadoop了。理论上每个机器要搞一遍。
tar -zxvf /download/hadoop-2.2.0-64bit.tar.gz -C /usr/local
将hadoop添加到环境变量
nano /etc/profile
在export PATH行之前添加一行
export HADOOP_HOME=/usr/local/hadoop-2.2.0
在export PATH行末尾追加
:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
然后激活环境变量配置
source /etc/profile
验证hadoop环境变量配置成功
hadoop version
接下来是修改配置。
首先进入配置文件目录。
cd /usr/local/hadoop-2.2.0/etc/hadoop
修改 hadoop-env.sh 文件。
#在27行修改 export JAVA_HOME=/usr/local/jdk1.7.0_71
修改 core-site.xml 文件。
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://ns1</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop-2.2.0/tmp</value> </property> <property> <name>ha.zookeeper.quorum</name> <value>crmleqee104:2181,crmleqee105:2181,crmleqee106:2181</value> </property> </configuration>
修改 hdfs-site.xml 文件。
<configuration> <property> <name>dfs.nameservices</name> <value>ns1</value> </property> <property> <name>dfs.ha.namenodes.ns1</name> <value>nn1,nn2</value> </property> <property> <name>dfs.namenode.rpc-address.ns1.nn1</name> <value>crmleqee101:9000</value> </property> <property> <name>dfs.namenode.http-address.ns1.nn1</name> <value>crmleqee101:50070</value> </property> <property> <name>dfs.namenode.rpc-address.ns1.nn2</name> <value>crmleqee102:9000</value> </property> <property> <name>dfs.namenode.http-address.ns1.nn2</name> <value>crmleqee102:50070</value> </property> <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://crmleqee104:8485;crmleqee105:8485;crmleqee106:8485/ns1</value> </property> <property> <name>dfs.journalnode.edits.dir</name> <value>/usr/local/hadoop-2.2.0/journal</value> </property> <property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property> <property> <name>dfs.client.failover.proxy.provider.ns1</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.fencing.methods</name> <value> sshfence shell(/bin/true) </value> </property> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/root/.ssh/id_rsa</value> </property> <property> <name>dfs.ha.fencing.ssh.connect-timeout</name> <value>30000</value> </property> </configuration>
建立 mapred-site.xml 文件。
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>crmleqee103:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>crmleqee103:19888</value> </property> </configuration>
修改 yarn-site.xml 文件。
<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>crmleqee103</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>crmleqee103:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>crmleqee103:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>crmleqee103:8031</value> </property> </configuration>
以上都是统一的。还有一种妖术,一台配置好之后,用scp把配置文件全部丢过去。
scp -r /usr/local/hadoop-2.2.0/etc/hadoop root@XXXX:/usr/local/hadoop-2.2.0/etc
接下来有一些个别的配置。
Hadoop集群配置slaves
在【101】上修改$HADOOP_HOME/etc/hadoop/slaves文件内容为:
Client104 Client105 Client106
在【102】上修改$HADOOP_HOME/etc/hadoop/slaves文件内容为:
Client104 Client105 Client106
在【103】上修改$HADOOP_HOME/etc/hadoop/slaves文件内容为:
Client104 Client105 Client106
然后要让root在以下关系上ssh免登陆。
关键执行ssh-keygen -t rsa
来在发起机器上创建钥匙,用ssh-copy-id [TARGET_CLIENT]
来发送钥匙到要登录的机器。
在【101】上能够访问其他所有机器。
在【102】上能够访问其他所有机器。
在【103】上能够访问全部DataNode,即【104】【105】【106】。
然后要装ZooKeeper在DataNode上,即【104】【105】【106】。
解压之。
tar -zxvf /download/zookeeper-3.4.5.tar.gz -C /usr/local/
将zooekeeper添加到环境变量中
nano /etc/profile
然后修改,在export PATH行之前添加一行
export ZOOKEEPER_HOME=/usr/local/zookeeper-3.4.5
在export PATH行末尾追加
:$ZOOKEEPER_HOME/bin
刷新配置
source /etc/profile
修改ZooKeeper配置文件
cd /usr/local/zookeeper-3.4.5/conf/ cp zoo_sample.cfg zoo.cfg nano zoo.cfg #修改配置 dataDir=/usr/local/zookeeper-3.4.5/tmp #在最后添加: server.1=Client104:2888:3888 server.2=Client105:2888:3888 server.3=Client106:2888:3888
建立tmp文件
#创建tmp文件夹 mkdir /usr/local/zookeeper-3.4.5/tmp #创建一个空文件 touch /usr/local/zookeeper-3.4.5/tmp/myid
【104】【105】【106】分别执行写入tmp
在【104】执行
echo 1 > /usr/local/zookeeper-3.4.5/tmp/myid
在【105】执行
echo 2 > /usr/local/zookeeper-3.4.5/tmp/myid
在【106】执行
echo 3 > /usr/local/zookeeper-3.4.5/tmp/myid
装完ZooKeeper就可以考虑初始化了。
首先在DataNode里启动ZooKeeper,即在【104】【105】【106】上
cd /usr/local/zookeeper-3.4.5/bin/ ./zkServer.sh start #查看状态:一个leader,两个follower ./zkServer.sh status
然后在主控机器【101】上面开journalnode.
cd /usr/local/hadoop-2.2.0 sbin/hadoop-daemons.sh start journalnode
之后在DataNode里用jps检查应该能发现成功标识有JournalNode进程。
接下来初始化并启动HDFS
#在最后几行中出现successfully formatted既为成功 hdfs namenode -format #同步 scp -r /usr/local/hadoop-2.2.0/tmp/ crmleqee102:/usr/local/hadoop-2.2.0/ #格式化ZK hdfs zkfc -formatZK #启动HDFS $HADOOP_HOME/sbin/start-dfs.sh
在ResourceManager【103】上面启用YARN和JOBHISTORY。
$HADOOP_HOME/sbin/start-yarn.sh $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
然后就可以在各台机器上跑jps看看是不是符合最前面的进程关系了。
接下来要验证集群。
验证HDFS HA
【101】执行:
#往HDFS写一个文件 hadoop fs -put /etc/profile /profile #看看写成功了没 hadoop fs -ls / #观察NameNode主进程,然后杀掉 jps kill
【102】观察Slave进程有没有及时启动保证文件系统安全
hadoop fs -ls /
出现/profile文件则为成功
【101】重新开启NameNode进程。
sbin/hadoop-daemon.sh start namenode
【103】这里是比较可怕的一部,被坑了两天。你需要跑成功这一个hadoop例子。
cd $HADOOP_HOME hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /profile /out
Hive只在ResourceManager【103】一个节点上安装即可。
解压。
tar -zxvf /download/apache-hive-1.2.1-bin.tar.gz -C /usr/local/
将hive添加到环境变量中
nano /etc/profile #在export PATH行之前添加一行 export HIVE_HOME=/usr/local/apache-hive-1.2.1-bin #在export PATH行末尾追加 :$HIVE_HOME/bin #刷新配置 source /etc/profile
配置hive。
将mysql的java connecter 5.1.39拷贝到$HIVE_HOME/lib/。
进入hive配置环境目录,建立hive-site.xml。
nano $HIVE_HOME/conf/hive-site.xml
文件内容大致如下。
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://rm-bp11eg6wdm3s1621c.mysql.rds.aliyuncs.com:3306/hive?useUnicode=true&characterEncoding=UTF-8</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>crm_crm</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>XXX</value> <description>password to use against metastore database</description> </property> <!--hive.server2 setting--> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> <property> <name>hive.server2.thrift.bind.host</name> <value>10.47.56.170</value> <!-- Hive安装的地址 --> </property> <property> <name>hive.server2.authentication</name> <value>NONE</value> </property> <!-- optimize --> <property> <name>mapred.reduce.tasks</name> <value>-1</value> <description>The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers. </description> </property> <property> <name>hive.exec.reducers.bytes.per.reducer</name> <value>1000000000</value> <description>size per reducer.The default is 1G, i.e if the input size is 10G, it will use 10 reducers.</description> </property> <property> <name>hive.exec.reducers.max</name> <value>999</value> <description>max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is negative, Hive will use this one as the max number of reducers when automatically determine number of reducers.</description> </property> <property> <name>hive.cli.print.header</name> <value>true</value> <description>Whether to print the names of the columns in query output.</description> </property> <property> <name>hive.cli.print.current.db</name> <value>true</value> <description>Whether to include the current database in the Hive prompt.</description> </property> <property> <name>hive.enforce.bucketing</name> <value>true</value> <description>Whether bucketing is enforced. If true, while inserting into the table, bucketing is enforced. </description> </property> <property> <name>hive.enforce.sorting</name> <value>true</value> <description>Whether sorting is enforced. If true, while inserting into the table, sorting is enforced. </description> </property> <property> <name>hive.optimize.bucketingsorting</name> <value>true</value> <description>If hive.enforce.bucketing or hive.enforce.sorting is true, don't create a reducer for enforcing bucketing/sorting for queries of the form: insert overwrite table T2 select * from T1; where T1 and T2 are bucketed/sorted by the same keys into the same number of buckets. </description> </property> <property> <name>hive.exec.dynamic.partition</name> <value>true</value> <description>Whether or not to allow dynamic partitions in DML/DDL.</description> </property> <property> <name>hive.exec.dynamic.partition.mode</name> <value>nonstrict</value> <description>In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions.</description> </property> <property> <name>hive.exec.max.dynamic.partitions</name> <value>100000</value> <description>Maximum number of dynamic partitions allowed to be created in total.</description> </property> <property> <name>hive.exec.parallel</name> <value>true</value> <description>Whether to execute jobs in parallel</description> </property> <property> <name>hive.exec.parallel.thread.number</name> <value>8</value> <description>How many jobs at most can be executed in parallel</description> </property> </configuration>
要在目标RDS上建hive数据库,记住字符集要设成latin1。
测试hive。
hive HIVE >create table trade_detail(id bigint, account string, income double, expenses double, time string) row format delimited fields terminated by '\t';
没报错就差不多了。
至于kettle、sqoop这种,虚无啊,我也不知道。
コメントを残す