BigData – Learn to become a developer

Zookeeper multiple-nodes cluster setup

November 10, 2017November 10, 2017trinhhieu668Leave a comment

In the previous post, i was overview about Zookeeper and explained how to it work (post). Zookeeper is core system, it provide an API to manage the status of application nodes in distributed environment, Zookeeper is very important component so what happens when a server Zookeeper is died ? Maybe the whole system will stop working and this is serious. To resolve this problem then Zookeeper support it can be run as cluster go when leader node die all of the nodes will run “Algorithm election” and a new node will became leader. So Zookeeper cluster is high availability, and below i will present details how to setup Zookeeper on multiple-nodes.

Pre-requisites : you have some servers to setup, with me i have 4 server need to setup Zookeeper cluster

Node1 : 10.3.0.100
Node2 : 10.3.0.101
Node3 : 10.3.0.102
Node4 : 10.3.0.103

Step 1 : Add information servers to DNS hostname configuration

vi /etc/hosts

10.3.0.100 cloud1 10.3.0.101 cloud2 10.3.0.102 cloud3 10.3.0.103 cloud4

Step 2 : You need download Zookeeper stable version from page (Zookeeper homepage) and save into node1

wget http://mirror.downloadvn.com/apache/zookeeper/stable/zookeeper-3.4.10.tar.gz tar -xvf zookeeper-3.4.10.tar.gz

Step 3 : Create path folder to store Zookeeper data (all nodes).

mkdir /data/hdfs/zookeeper

Step 4 : Setup memory use to run Zookeeper instance, create java.env to add the below configuration

vi $ZOOKEEPER_HOME/conf/java.env

Add this below content to file

export JAVA_OPTS="-Xms4096m -Xmx4096m"

(i was changed memory to run Zookeeper instance to 4GB)

Step 5 : Config zookeeper

cp zoo_sample.cfg zoo.cfg vi zoo.cfg

add this below configuration to file

tickTime=2000 initLimit=10 syncLimit=5 dataDir=/data/hdfs/zookeeper clientPort=2181 maxClientCnxns=2000 server.1=cloud1:2888:3888 server.2=cloud2:2888:3888 server.3=cloud3:2888:3888 server.4=cloud4:2888:3888

Explain some parameters is used

maxClientCnxns : max clients number connection to node.
clientPort : current node listen on port 2181
dataDir : path to store data.
3888 : port listen connect from other nodes in cluster.
2888 : port listen if that is leader node.

Step 6 : Copy zookeeper folder to all of them

rsync -avz zookeeper-3.4.10/ cloud2:/data/zookeeper-3.4.10/ rsync -avz zookeeper-3.4.10/ cloud3:/data/zookeeper-3.4.10/ rsync -avz zookeeper-3.4.10/ cloud4:/data/zookeeper-3.4.10/

Step 7 : Config ID of node to run cluster

on cloud1

echo "1" >> /data/hdfs/zookeeper/myid

on cloud2

echo "2" >> /data/hdfs/zookeeper/myid

on cloud3

echo "3" >> /data/hdfs/zookeeper/myid

on cloud4

echo "4" >> /data/hdfs/zookeeper/myid

Step 8 : Start all instance with command

./bin/zkServer.sh start

Run command jps to check

Done, If you have any question about setup Zookeeper cluster please contact me (facebook or linked) we can discuss about it. Thank for reading.

Zookeeper coordination service for distributed system :)

November 1, 2017November 2, 2017trinhhieu6681 Comment

Zookeeper open source very powerful for distributed applications. It is used very popular in Hadoop ecosystem and the fist i knew it is when i build cloud message for AZStack-SDK, we use Hbase + Hadoop + Zookeeper build cluster to store message and query over hundred GB data. Hadoop and Hbase are distributed application, with Hadoop (multiple namenode, datanode) and Hbase (regionserver). Zookeeper is really excellent coordinator. After use Zoo for cloud message, i started to learn more about it so it really cool for me. In fact, the way information in zookeeper is organized is quite similar to a file system. At the top there is a root simply referred to as /, below root are znodes.

zknamespace

Unlike an ordinary distributed file system. Zookeeper supports the concepts of ephemeral zNode and sequential zNode. An ephemeral zNode is a node that will disappear when the session of its owner ends. In distributed application every server cam be defined public ip in an ephemeral node, when server loose connectivity with Zookeeper and fail to reconnect within session timeout then all information about that server is deleted. The below figure is a illustration how to use ephemeral node for manage distributed services

zookeeper_ephemeral

Sequential nodes are nodes whose names are automatically assigned a sequence number suffix. This suffix is auto increment and assigned by Zookeeper when the zNode is created. An easy way of doing leader election with Zookeeper is to let every server publish it information in a zNode that is both sequential and ephemeral. Then whichever server has the lowest sequential zNode is the leader. If the leader or any other server goes offline, its session dies and its ephemeral node is removed, and all other servers can observe who is the new leader.

How to Zookeeper cluster work with client ?

Zookeeper cluster have 2 types (Leader, Follower) system run load balance when client connect to cluster, every instance zookeeper can be served many clients. If client request read data to Follower instance then it response direct by local data, however when client request write data this request will be forwarded to leader and leader broadcast into all node to update data on follower instance.

fig01

The main objective of the post is introduce the overview about Zookeeper. The next post i will give you instructions how to install, config and how to use Zookeeper in distributed application. Thank you for reading.

Build Hadoop 2.7 from source on Centos step by step

October 6, 2017October 6, 2017trinhhieu668Leave a comment

Hadoop is one of the best open source for store and processing big data. It has a lot of supports from community and many big companies have used it for their products. In my company, Hadoop ecosystem have used to store message chat and information log, it is very effective but it required many resources server as ram, cpu and disk. If your product is small system you should consider using it.

Ok let start find answer for question “How to build hadoop from source ?”

Step 1 : The fist you should disable Firewall local
sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config

Step 2 : Download JDK and setup environment
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u45-b14/jdk-8u45-linux-x64.tar.gz
tar -xzf jdk-8u45-linux-x64.tar.gz -C /opt/

Step 3 : Create user and group “hadoop” as user run service
groupadd hadgroup useradd haduser -G hadgroup passwd haduser

Step 4 : Create ssh-key for authentication between servers
ssh-keygen

Step 5 : Install tool development and library
yum groupinstall "Development Tools" "Development Libraries"
yum install openssl-devel cmake

Step 6 : Install maven to build Hadoop (source)
tar -zxf apache-maven-3.3.9-bin.tar.gz -C /opt/

Step 7 : Setup maven environment
export JAVA_HOME=/opt/jdk1.8.0_45 export M3_HOME=/opt/apache-maven-3.3.9 export PATH=/opt/apache-maven-3.3.9/bin:$PATH

Step 8 : Build Protobuf (source)
tar -xzf protobuf-2.5.0.tar.gz -C /root ./configure make make install sudo ldconfig

Step 9 : Download source and build Hadoop (source)
tar -xvf hadoop-2.7.1-src.tar.gz cd hadoop-2.7.1-src mvn package -Pdist,native -DskipTests -Dtar -Dmaven.javadoc.skip=true -Dmaven.javadoc.failOnError=false

Step 10 : Move build to new folder
mv hadoop-2.7.0-src/hadoop-dist/target/hadoop-2.7.0 /opt/

Done, and now you have Hadoop was built at path /opt/hadoop-2.7.0
In the next post, i will write how to setup hadoop as cluster. Thank you!

You can learn anything on internet !

March 7, 2017March 7, 2017trinhhieu668Leave a comment