CentOS集群部署Hadoop、ZooKeeper、Hive等集成环境

首先买了6台阿里云的ECS,装了CentOS 6.5 64-bit,把用户密码什么的都配置完毕。
集群规划如此如此。

主机名	安装的软件 | 运行的进程(跑jps命令)
Client101	jdk、hadoop | NameNode、DFSZKFailoverController
Client102	jdk、hadoop | NameNode、DFSZKFailoverController
Client103	jdk、hadoop、hive、kettle、sqoop	| ResourceManager
Client104	jdk、hadoop、zookeeper | DataNode、NodeManager、JournalNode、QuorumPeerMain
Client105	jdk、hadoop、zookeeper | DataNode、NodeManager、JournalNode、QuorumPeerMain
Client106	jdk、hadoop、zookeeper | DataNode、NodeManager、JournalNode、QuorumPeerMain

准备各种素材,比如JDK,Hadoop,ZooKeeper,Hive,等等。统一放到各台机器的/download/目录下去。

接下来整理每一台机器的IP和主机名。阿里云的ECS默认就带着一个iZxxxxZ的主机名,然后为了方便可以自己再根据需要定一个主机名,比如ClientXXX这样。最后全部归结起来,形成一个ip表。

IP1 iZ11111Z
IP1 Client101
......
IPn iZnnnnnZ
IPn Client10n

这些得填到/etc/hosts里面。

阿里云的ECS一般不会开iptables防火墙,但是安全起见也先关掉。

#关闭防火墙
service iptables status;service iptables stop
#关闭防火墙开机启动
chkconfig iptables --list;chkconfig iptables off

关掉之后重启Linux。

reboot

然后是在每一台机器上按照JDK。默认认为安装包已经放在那个目录了。用解压命令丢到/usr/local下面。

tar -zxvf /download/jdk-7u71-linux-x64.gz -C /usr/local

将java添加到环境变量中

nano /etc/profile

在文件最后添加

export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin

然后刷新配置

source /etc/profile

验证JDK配置成功

java -version

然后是时候装hadoop了。理论上每个机器要搞一遍。

tar -zxvf /download/hadoop-2.2.0-64bit.tar.gz -C /usr/local

将hadoop添加到环境变量

nano /etc/profile

在export PATH行之前添加一行

export HADOOP_HOME=/usr/local/hadoop-2.2.0

在export PATH行末尾追加

:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

然后激活环境变量配置

source /etc/profile

验证hadoop环境变量配置成功

hadoop version

接下来是修改配置。
首先进入配置文件目录。

cd /usr/local/hadoop-2.2.0/etc/hadoop

修改 hadoop-env.sh 文件。

#在27行修改
export JAVA_HOME=/usr/local/jdk1.7.0_71

修改 core-site.xml 文件。

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://ns1</value>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/usr/local/hadoop-2.2.0/tmp</value>
	</property>
	<property>
		<name>ha.zookeeper.quorum</name>
		<value>crmleqee104:2181,crmleqee105:2181,crmleqee106:2181</value>
	</property>
</configuration>

修改 hdfs-site.xml 文件。

<configuration>	
	<property>
		<name>dfs.nameservices</name>
		<value>ns1</value>
	</property>
	<property>
		<name>dfs.ha.namenodes.ns1</name>
		<value>nn1,nn2</value>
	</property>
	<property>
		<name>dfs.namenode.rpc-address.ns1.nn1</name>
		<value>crmleqee101:9000</value>
	</property>
	<property>
		<name>dfs.namenode.http-address.ns1.nn1</name>
		<value>crmleqee101:50070</value>
	</property>
	<property>
		<name>dfs.namenode.rpc-address.ns1.nn2</name>
		<value>crmleqee102:9000</value>
	</property>
	<property>
		<name>dfs.namenode.http-address.ns1.nn2</name>
		<value>crmleqee102:50070</value>
	</property>
	<property>
		<name>dfs.namenode.shared.edits.dir</name>
		<value>qjournal://crmleqee104:8485;crmleqee105:8485;crmleqee106:8485/ns1</value>
	</property>
	<property>
		<name>dfs.journalnode.edits.dir</name>
		<value>/usr/local/hadoop-2.2.0/journal</value>
	</property>
	<property>
		<name>dfs.ha.automatic-failover.enabled</name>
		<value>true</value>
	</property>
	<property>
		<name>dfs.client.failover.proxy.provider.ns1</name>		
		<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
	</property>
	<property>
		<name>dfs.ha.fencing.methods</name>
		<value>
			sshfence
			shell(/bin/true)
		</value>
	</property>
	<property>
		<name>dfs.ha.fencing.ssh.private-key-files</name>
		<value>/root/.ssh/id_rsa</value>
	</property>
	<property>
		<name>dfs.ha.fencing.ssh.connect-timeout</name>
		<value>30000</value>
	</property>
</configuration>

建立 mapred-site.xml 文件。

<configuration>
	<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
	</property>

	<property>
	    <name>mapreduce.jobhistory.address</name>
	    <value>crmleqee103:10020</value>
	</property>


	<property>
	    <name>mapreduce.jobhistory.webapp.address</name>
	    <value>crmleqee103:19888</value>
	</property>
</configuration>

修改 yarn-site.xml 文件。

<configuration>
	<property>
		<name>yarn.resourcemanager.hostname</name>
		<value>crmleqee103</value>
	</property>
	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>

	<property>  
		<name>yarn.resourcemanager.address</name>  
		<value>crmleqee103:8032</value>  
	</property>  
	<property>  
		<name>yarn.resourcemanager.scheduler.address</name>  
		<value>crmleqee103:8030</value>  
	</property>  
	<property>  
		<name>yarn.resourcemanager.resource-tracker.address</name>  
		<value>crmleqee103:8031</value>  
	</property>  
</configuration>

以上都是统一的。还有一种妖术,一台配置好之后,用scp把配置文件全部丢过去。

scp -r /usr/local/hadoop-2.2.0/etc/hadoop root@XXXX:/usr/local/hadoop-2.2.0/etc

接下来有一些个别的配置。

Hadoop集群配置slaves
在【101】上修改$HADOOP_HOME/etc/hadoop/slaves文件内容为:

Client104
Client105
Client106

在【102】上修改$HADOOP_HOME/etc/hadoop/slaves文件内容为:

Client104
Client105
Client106

在【103】上修改$HADOOP_HOME/etc/hadoop/slaves文件内容为:

Client104
Client105
Client106

然后要让root在以下关系上ssh免登陆。
关键执行ssh-keygen -t rsa来在发起机器上创建钥匙,用ssh-copy-id [TARGET_CLIENT]来发送钥匙到要登录的机器。
在【101】上能够访问其他所有机器。
在【102】上能够访问其他所有机器。
在【103】上能够访问全部DataNode,即【104】【105】【106】。

然后要装ZooKeeper在DataNode上,即【104】【105】【106】。
解压之。

tar -zxvf /download/zookeeper-3.4.5.tar.gz -C /usr/local/

将zooekeeper添加到环境变量中

nano /etc/profile

然后修改,在export PATH行之前添加一行

export ZOOKEEPER_HOME=/usr/local/zookeeper-3.4.5

在export PATH行末尾追加

:$ZOOKEEPER_HOME/bin

刷新配置

source /etc/profile

修改ZooKeeper配置文件

cd /usr/local/zookeeper-3.4.5/conf/
cp zoo_sample.cfg zoo.cfg
nano zoo.cfg
#修改配置
dataDir=/usr/local/zookeeper-3.4.5/tmp
#在最后添加:
server.1=Client104:2888:3888
server.2=Client105:2888:3888
server.3=Client106:2888:3888

建立tmp文件

#创建tmp文件夹
mkdir /usr/local/zookeeper-3.4.5/tmp
#创建一个空文件
touch /usr/local/zookeeper-3.4.5/tmp/myid

【104】【105】【106】分别执行写入tmp
在【104】执行

echo 1 > /usr/local/zookeeper-3.4.5/tmp/myid

在【105】执行

echo 2 > /usr/local/zookeeper-3.4.5/tmp/myid

在【106】执行

echo 3 > /usr/local/zookeeper-3.4.5/tmp/myid

装完ZooKeeper就可以考虑初始化了。
首先在DataNode里启动ZooKeeper,即在【104】【105】【106】上

cd /usr/local/zookeeper-3.4.5/bin/
./zkServer.sh start
#查看状态:一个leader,两个follower
./zkServer.sh status

然后在主控机器【101】上面开journalnode.

cd /usr/local/hadoop-2.2.0
sbin/hadoop-daemons.sh start journalnode

之后在DataNode里用jps检查应该能发现成功标识有JournalNode进程。
接下来初始化并启动HDFS

#在最后几行中出现successfully formatted既为成功
hdfs namenode -format
#同步
scp -r /usr/local/hadoop-2.2.0/tmp/ crmleqee102:/usr/local/hadoop-2.2.0/
#格式化ZK
hdfs zkfc -formatZK
#启动HDFS
$HADOOP_HOME/sbin/start-dfs.sh

在ResourceManager【103】上面启用YARN和JOBHISTORY。

$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

然后就可以在各台机器上跑jps看看是不是符合最前面的进程关系了。

接下来要验证集群。
验证HDFS HA
【101】执行:

#往HDFS写一个文件
hadoop fs -put /etc/profile /profile
#看看写成功了没
hadoop fs -ls /
#观察NameNode主进程,然后杀掉
jps
kill 

【102】观察Slave进程有没有及时启动保证文件系统安全

hadoop fs -ls /

出现/profile文件则为成功
【101】重新开启NameNode进程。

sbin/hadoop-daemon.sh start namenode

【103】这里是比较可怕的一部,被坑了两天。你需要跑成功这一个hadoop例子。

cd $HADOOP_HOME
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /profile /out

Hive只在ResourceManager【103】一个节点上安装即可。
解压。

tar -zxvf /download/apache-hive-1.2.1-bin.tar.gz -C /usr/local/

将hive添加到环境变量中

nano /etc/profile
#在export PATH行之前添加一行
export HIVE_HOME=/usr/local/apache-hive-1.2.1-bin
#在export PATH行末尾追加
:$HIVE_HOME/bin
#刷新配置
source /etc/profile

配置hive。
将mysql的java connecter 5.1.39拷贝到$HIVE_HOME/lib/。
进入hive配置环境目录,建立hive-site.xml。

nano $HIVE_HOME/conf/hive-site.xml 

文件内容大致如下。

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->

<configuration>
	<property>
	  <name>javax.jdo.option.ConnectionURL</name>
	  <value>jdbc:mysql://rm-bp11eg6wdm3s1621c.mysql.rds.aliyuncs.com:3306/hive?useUnicode=true&amp;characterEncoding=UTF-8</value>
	  <description>JDBC connect string for a JDBC metastore</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionDriverName</name>
	  <value>com.mysql.jdbc.Driver</value>
	  <description>Driver class name for a JDBC metastore</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionUserName</name>
	  <value>crm_crm</value>
	  <description>username to use against metastore database</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionPassword</name>
	  <value>XXX</value>
	  <description>password to use against metastore database</description>
	</property>

	<!--hive.server2 setting-->
	<property>  
  		<name>hive.server2.thrift.port</name>  
    		<value>10000</value>  
    	</property>  
    	<property>  
      		<name>hive.server2.thrift.bind.host</name>  
        	<value>10.47.56.170</value>  <!-- Hive安装的地址 -->
	</property>
	<property>
		<name>hive.server2.authentication</name>
		<value>NONE</value>
	</property>
	
	<!-- optimize -->
	<property>
	  <name>mapred.reduce.tasks</name>
	  <value>-1</value>
		<description>The default number of reduce tasks per job.  Typically set
	  to a prime close to the number of available hosts.  Ignored when
	  mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value.
	  By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
	  </description>
	</property>

	<property>
	  <name>hive.exec.reducers.bytes.per.reducer</name>
	  <value>1000000000</value>
	  <description>size per reducer.The default is 1G, i.e if the input size is 10G, it will use 10 reducers.</description>
	</property>

	<property>
	  <name>hive.exec.reducers.max</name>
	  <value>999</value>
	  <description>max number of reducers will be used. If the one
		specified in the configuration parameter mapred.reduce.tasks is
		negative, Hive will use this one as the max number of reducers when
		automatically determine number of reducers.</description>
	</property>

	<property>
	  <name>hive.cli.print.header</name>
	  <value>true</value>
	  <description>Whether to print the names of the columns in query output.</description>
	</property>

	<property>
	  <name>hive.cli.print.current.db</name>
	  <value>true</value>
	  <description>Whether to include the current database in the Hive prompt.</description>
	</property>
	
	<property>
	  <name>hive.enforce.bucketing</name>
	  <value>true</value>
	  <description>Whether bucketing is enforced. If true, while inserting into the table, bucketing is enforced. </description>
	</property>
	
	<property>
	  <name>hive.enforce.sorting</name>
	  <value>true</value>
	  <description>Whether sorting is enforced. If true, while inserting into the table, sorting is enforced. </description>
	</property>

	<property>
	  <name>hive.optimize.bucketingsorting</name>
	  <value>true</value>
	  <description>If hive.enforce.bucketing or hive.enforce.sorting is true, don't create a reducer for enforcing
		bucketing/sorting for queries of the form: 
		insert overwrite table T2 select * from T1;
		where T1 and T2 are bucketed/sorted by the same keys into the same number of buckets.
	  </description>
	</property>
	
	<property>
	  <name>hive.exec.dynamic.partition</name>
	  <value>true</value>
	  <description>Whether or not to allow dynamic partitions in DML/DDL.</description>
	</property>

	<property>
	  <name>hive.exec.dynamic.partition.mode</name>
	  <value>nonstrict</value>
	  <description>In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions.</description>
	</property>

	<property>
	  <name>hive.exec.max.dynamic.partitions</name>
	  <value>100000</value>
	  <description>Maximum number of dynamic partitions allowed to be created in total.</description>
	</property>
	
	<property>
	  <name>hive.exec.parallel</name>
	  <value>true</value>
	  <description>Whether to execute jobs in parallel</description>
	</property>

	<property>
	  <name>hive.exec.parallel.thread.number</name>
	  <value>8</value>
	  <description>How many jobs at most can be executed in parallel</description>
	</property>
	
	
</configuration>

要在目标RDS上建hive数据库,记住字符集要设成latin1。
测试hive。

hive
HIVE >create table trade_detail(id bigint, account string, income double, expenses double, time string) row format delimited fields terminated by '\t';

没报错就差不多了。

至于kettle、sqoop这种,虚无啊,我也不知道。

紹介

クリスチャンです。

コメントを残す

このサイトはスパムを低減するために Akismet を使っています。コメントデータの処理方法の詳細はこちらをご覧ください