Blog

batch processing, Big Data, continuous flow operator, Flink, Hadoop, micro-batch model, Real-time processing, Spark

Big Data Computational Models

Most of the distributed processing engine comes along with either one or more of three different types of computational models, namely batch, micro-batch, and continuous flow operator.
  • Batch model: Batch model processes the data at rest, taking a large amount of data at once and then processing it and then writing out the output to some file system or data stores.
  • Micro-batch model: Micro-batch combines the aspects of both batch and the continuous flow operator. In this model data gets gathered at perticular time interval and gets executed. Micro-batches are an essentially collect and then process” kind of computational model.
  • Continuous flow operator model: This model processes the data when it arrives, without any delay in collecting or processing the data.
To have better understanding about Micro-batch and Continuous flow model, lets assume you wants to supply water into a water tank situated on top of a building. So there are two ways to do this, first you can collect it in lower basement water tank and then supplying it into the top most water tank from there but in second way you can directly supply the water to top most water tank using the pipe lines. That is essentially the basic difference between micro-batch and continuous flow operator model.
batch processing, Difference between Spark and Flink, Flink, Flink CEP, Real-time processing, Record based windowing, Spark, Spark vs Flink

Apache Spark vs Apache Flink

In last few years the real time processing is one of hot topic in the market and also it has gained lots of value as well. The emerging tools tools which provides support for real time processing are Apache Storm, Apache Spark and Apache Flink. In this blog I am going to compare different features of Spark and Flink. 
Feature by feature Comparison of Apache Spark and Flink:
Exactly once semantic:
Both Spark Streaming and Flink provide exactly once guarantee, which means that every record will be processed exactly once, thereby eliminating processing of data multiple times.
High Throughput and Fault Tolerance overhead:
Both Spark Streaming and Flink provides very high throughput compared to other processing systems like Storm. The overhead of fault tolerance is also low in both the processing engines.
Computational Model:
Spark Streaming and Flink differs is in its computation model. Spark has adopted micro-batching model whereas Flink has adopted a continuous flow, operator-based streaming model.
Data Windowing/Batching: 

Spark has a time-based Window criteria, whereas Flink has Windows over time, record counts, sessions, data-driven windows or any custom user-defined Window criteria.

Stream Splitting: 
Flink has direct API for splitting the input Data Stream into multiple streams, whereas in Spark it is not possible.
 
Complex Event Processing:
Flink comes along with the complex event processing API which means Flink has support for Event Time and Out-of-Order Events while Spark does not have it.
Memory Management: 
Spark provides configurable memory management while Flink provides automatic memory management. Spark has also moved towards automating memory management (Unified memory management) in Spark 1.6.0 version.
install solr, run solr, setup solr, solr cluster, solr cluster setup, solr installation steps, SolrCloud

Setup SolrCloud

This post explain how to install the solr cluster. The below installation steps, install and run the solr cluster. After installation of cluster, you can create the indexes, aliases and fields by providing the solrconfig.xml and schema.xml file.
Perform the below steps on all nodes where you wants to install the Solr.
 1.Download the Solr bundle from http://archive.apache.org/dist/lucene/solr/4.7.2/solr-4.7.2.zip using the below command:

2. Unzip the downloaded bundle:
 unzip solr-4.7.2.zip
3. Create a solr home directory to store the solr data:
 mkdir-p solr/home/
4. Move to the solr home directory and create an file named “solr.xml”:

 cd solr/home/
Create the solr.xml file and add the following content in it.
 

 
 

 
   ${host:}
   ${jetty.port:}
   ${hostContext:solr}
   ${zkClientTimeout:30000}
   ${genericCoreNodeNames:true}
 

 
   ${socketTimeout:0}
   ${connTimeout:0}
 
 
5. Now move to the example directory available inside the extracted solr folder:
 cd solr-4.7.2/example/
6. Before starting the solr, you should have running zookeeper cluster. Solr need the IP address and port of zookeeper to start.
nohup java -Xms128M -Xmx1024M -Dsolr.solr.home=solr/home/ -Djetty.port=8983 -DzkHost= zkHost1:2181,zkHost2:2181,zkHost3:2181 -DzkClientTimeout=10000 -DsocketTimeout=1000 -DconnTimeout=1000 -DhostContext=/solr -jar start.jar &
compression, Hadoop, hadoop-lzo compression, Hbase, Lzo compression, lzo compression in hadoop, lzo compression in hbase

LZO Compression in Hadoop and HBase

LZO’s licence (GPL) is incompatible with Hadoop (Apache) and therefore one should install the LZO separately in cluster to enable LZO compression in Hadoop and HBase. LZO compression format is split-table compression. It provides the high compression and decompression speed.
Perform the below steps to enable the LZO compression in Hadoop and HBase:
1.       Install the LZO development packages:
 sudo yum install lzo lzo-devel
2.       Download the Latest LZO release using below command:
3.       Unzip the downloaded bundle:
 unzip release-0.4.17.zip
4.       Change the current directory to the extracted folder:
cd hadoop-lzo-release-0.4.17
5.       Run the command to generate the native libraries

 ant compile-native
6.       Copy the generated jar and native libraries to Hadoop and HBase lib directories.
 cp build/hadoop-lzo-0.4.17.jar $HADOOP_HOME/lib/
 cp build/hadoop-lzo-0.4.17.jar $HBASE_HOME/lib/
 cp build/hadoop-lzo-0.4.17/lib/native/Linux-amd64-64/* $HADOOP_HOME/lib/native/
 cp build/hadoop-lzo-0.4.17/lib/native/Linux-amd64-64/* $HBASE_HOME/lib/native/
7.       Add the following properties in core-site.xml file of hadoop.

 
                io.compression.codecs
               
                    org.apache.hadoop.io.compress.DefaultCodec,
                    org.apache.hadoop.io.compress.GzipCodec,
                    org.apache.hadoop.io.compress.BZip2Codec,
        org.apache.hadoop.io.compress.DeflateCodec,
        org.apache.hadoop.io.compress.SnappyCodec,
        org.apache.hadoop.io.compress.Lz4Codec,
        com.hadoop.compression.lzo.LzoCodec,
        com.hadoop.compression.lzo.LzopCodec
  
 
 
                io.compression.codec.lzo.class
    com.hadoop.compression.lzo.LzoCodec
 

8.       Sync the hadoop and HBase Home directory on all nodes of hadoop and hbase cluster.
 rsync $HADOOP_HOME/ node1:$HADOOP_HOME/ node2:$HADOOP_HOME/
 rsync $HBASE_HOME/ node1:$HBASE_HOME/ node2:$HBASE_HOME/
9.       Add the HADOOP_OPTS  variable in .bashrc  file on all hadoop nodes:
 export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib/native:$HADOOP_HOME/lib/”
10.   Add the HBASE_OPTS  variable in .bashrc  file on all HBase nodes:
 export HBASE_OPTS=”-Djava.library.path=$HBASE_HOME/lib/native/:$HBASE_HOME/lib/”
11.   Verify the LZO compression in Hadoop:
a.       Create a LZO compressed file using lzop utility. Below command will create a compressed file for the LICENSE.txt file which is available inside the HADOOP_HOME directory.
lzop LICENSE.txt
b.      Copy the Generated LICENSE.txt.lzo file to / (root) HDFS path using below command.
bin/hadoop fs -copyFromLocal LICENSE.txt.lzo /
c.       Index the LICENSE.txt.lzo file in HDFS using below command.
bin/hadoop jar lib/hadoop-lzo-0.4.17.jar com.hadoop.compression.lzo.LzoIndexer /LICENSE.txt.lzo
Once you execute the above command you will see the below output on console. You can also verify the index file creation on HADOOP UI in HDFS Browser.
14/12/20 14:04:05 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
14/12/20 14:04:05 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo revc461d77a0feec38a2bba31c4380ad60084c09205]
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/repo/hadoop-2.4.1/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It’s highly recommended that you fix the library with ‘execstack -c ‘, or link it with ‘-z noexecstack’.
14/12/20 14:04:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
14/12/20 14:04:08 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /LICENSE.txt.lzo, size 0.00 GB…
14/12/20 14:04:08 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
14/12/20 14:04:09 INFO lzo.LzoIndexer: Completed LZO Indexing in 0.61 seconds (0.01 MB/s).  Index size is 0.01 KB.
12.   Verify the LZO Compression in HBase:
You can verify the LZO Compression in HBase by creating a table using the LZO compression from HBase shell.
a.       Create a table with LZO Compression using below command:
create ‘t1’, { NAME=>’f1’, COMPRESSION=>’lzo’ }
b.      Verify the Compression type in table using below describe command on table:
describe ‘t1’
Once you execute the above command you will see the below console output. The LZO Compression for the table can also be verified on HBase UI.
 DESCRIPTION ENABLED
 ‘t1’,  { NAME => ‘f1’ , DATA_BLOCK_ENCODING => ‘NONE’ ,  BLOOMFILTER => ‘ROW’, REPLICATION_SCOPE => ‘0’, VERSION true  S => ‘1’, COMPRESSION => ‘LZO’, MIN_VERSIONS => ‘0’, TTL => ‘FOREVER’, KEEP_DELETED_CELLS => ‘false’, BLOCK SIZE => ‘65536’, IN_MEMORY => ‘false’, BLOCKCACHE => ‘true’}
1 row(s) in 0.8250 seconds
ElasticSearch, elasticsearch installation, elasticsearch.yaml, ES installation, indexing

ElasticSearch Installation

This post explains the installation steps of ElasticSearch(1.3.4) cluster on linux machine. For ElasticSeach, It is required to have java installed on machines. If java is not available on machines then first install it.
After installing the java on machines, follow the below steps to install the ElasticSearch.

1. Download ElasticSearch: Download the ElasticSearch bundle from the website and extract the bundle using the below commands. 

wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.3.4.zip 
tar –xvf elasticsearch-1.3.4.zip


3. Configure ElasticSearch: To configure the ElasticSearch, It is required to add the following properties in elasticsearch.yaml file, which is available inside the conf directory of extracted directory. 

a. cluster.name: This is the name of elastic search cluster and it should be same across all nodes of the cluster. This property is responsible for joining all the nodes in cluster. 
b. node.name:  This property defines the name of node. It should be unique across all the nodes.
c. path.data: This property defines the path for the data which is going to stored in elastic search for the indexes.
d. action.auto_create_index: This property restricts the auto creation of indexes. It accepts the true/false value. 
e. index.number_of_shards: It is used to set the number of shards for the index. 
f. index.number_of_replicas: It is used to set the number of replication for the index. 
g. bootstrap.mlockall: It Locks the memory for better performance of ElasticSearch.


4. Start ElasticSearch: To start the elastic search run the below command from the bin directory of extracted elastic search directory. A flag -d is used to run the elasticsearch in daemon mode.

./elasticsearch -d

ElasticSearch Cluster Setup on two nodes:

To install the elastic search as a cluster, you have to perform the above 3 steps on the nodes where you want to install the elastic search cluster. To configure the ElasticSearch let’s take an example of 2 nodes for the elastic search cluster. 

ElasticSearch Configuration on node1 in elasticsearch.yaml file.
cluster.name : elasticsearch_cluster

node.name : node1
path.data : /home/$username/elasticsearch-1.3.4/data_dir
action.auto_create_index : false
index.number_of_shards : 2
index.number_of_replicas : 1
bootstrap.mlockall : false

ElasticSearch configuration on node2 in elasticsearch.yaml file.

cluster.name : elasticsearch_cluster

node.name : node2
path.data : /home/$username/elasticsearch-1.3.4/data_dir
action.auto_create_index : false
index.number_of_shards : 2
index.number_of_replicas : 1
bootstrap.mlockall : false

Now run the elastic search on both the nodes using the command mentioned in 3rd step.

configure the embedded tomcat, embedded, embedded tomcat, embedded tomcat7, embedding tomcat, tomcat maven plugin

Leveraging Embedded Tomcat in Maven Application

In java when we develop a web application, we go for installation of tomcat on intended system. And then we deploy the application on tomcat for running it or for accessing it from the browser.
How about developing an application which can directly run on the intended system without installation of tomcat? Is it really achievable?
Yes, this can be easily achieved through Maven plug-in. Let us see how!
You might be puzzled as to how this is possible to run a web application without the installation of web container. Let us quickly have a look at the procedure which is pretty simple.
To accomplish this you just need to use a maven plug-in in pom.xml file of application. This plugin simply embed the tomcat inside the application and on build of application it generates an executable jar file, which can be easily executed by the java jar command and you can access the application on browser.
Please follow the below detailed step-by-step procedure:
1.   Add the following plugin in your pom.xml file:
<plugin>
    <groupId>org.apache.tomcat.maven</groupId>
    <artifactId>tomcat7-maven-plugin</artifactId>
    <version>2.1</version>
    <executions>
        <execution>
        <id>tomcat-run</id>
     <goals>
            <goal>exec-war-only</goal>
        </goals>
       <phase>package</phase>
        <configuration>
            <path>/test</path>
            <attachartifactclassifier>exec-war</attachartifactclassifier>
            <attachartifactclassifiertype>jar</attachartifactclassifiertype>
        </configuration>
        </execution>
    </executions>
</plugin>
2.     Build your application: Build your application using the install or package goal. This will generate the 3 extra files in application target folder.

 

    a.  yourappname-version-exec-war.jar: An executable jar file with embedded tomcat. 
    b. war-exec.manifest:  Manifest file contains Main class name. 
    c. war-exec.properties: A properties file contains the tomcat information and some configuration options.
3.     Run your application: Once you are ready with the above three files, execute the below command to run the application.
 java –jar yourappname-version-exec-war.jar

Hadoop, Hadoop and Ganglia integration, Hadoop Metrics, Hadoop Monitoring, hadoop-metrics.properties

Hadoop Monitoring Using Ganglia

This blog is about monitoring the Hadoop metrics such as DFS, MAPRED, JVM, RPC and UGI using the Ganglia Monitoring Tool. I assume that the readers of blog have prior knowledge of Ganglia and Hadooptechnology. To integrate the Ganglia with Hadoop you need to configure hadoop-metrics.properties file of hadoop located inside the hadoop conf folder. In this configuration file you need to configure the server address of ganglia gmetad, period for sending metrics data and ganglia context class name. The format and name of hadoop metrics properties file is different for various hadoop versions. 
  • For Hadoop 0.20.x, 0.21.0 and 0.22.0 versions, the file name is hadoop-metrics.properties.
  • For Hadoop 1.x.x and 2.x.x versions, the file name is hadoop-metrics2.properties
The ganglia context class name also differs with version change of ganglia, for detailed information about Ganglia Context class you can read from GangliaContext.
Procedure of configuring the hadoop metrics properties file: 
1.  Configuration for 2.x.x versions: In such hadoop versions the metrics properties file is located inside the $HADOOP_HOME/etc/hadoop/folder. Configure the hadoop-metrics2.properties file using the code as shown below:  
namenode.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
namenode.sink.ganglia.period=10
namenode.sink.ganglia.servers=gmetad_server_ip:8649

datanode.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
datanode.sink.ganglia.period=10
datanode.sink.ganglia.servers=gmetad_server_ip:8649

resourcemanager.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
resourcemanager.sink.ganglia.period=10
resourcemanager.sink.ganglia.servers=gmetad_server_ip:8649

nodemanager.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
nodemanager.sink.ganglia.period=10
nodemanager.sink.ganglia.servers=gmetad_server_ip:8649

2.    Configuration for 1.x.x versions: In such hadoop versions the metrics properties file is located inside the $HADOOP_HOME/conf/ folder. Configure the hadoop-metrics2.properties file using the code as shown below:
namenode.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
namenode.sink.ganglia.period=10
namenode.sink.ganglia.servers=gmetad_server_ip:8649

datanode.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
datanode.sink.ganglia.period=10
datanode.sink.ganglia.servers=gmetad_server_ip:8649

jobtracker.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
jobtracker.sink.ganglia.period=10
jobtracker.sink.ganglia.servers=gmetad_server_ip:8649

tasktracker.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
tasktracker.sink.ganglia.period=10
tasktracker.sink.ganglia.servers=gmetad_server_ip:8649

maptask.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
maptask.sink.ganglia.period=10
maptask.sink.ganglia.servers=gmetad_server_ip:8649
reducetask.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
reducetask.sink.ganglia.period=10
reducetask.sink.ganglia.servers=gmetad_server_ip:8649
3.    Configuration for 0.20.x, 0.21.0 and 0.22.0 versions: In such hadoop versions the metrics properties file is located inside the $HADOOP_HOME/conf/folder. Configure the hadoop-metrics.properties file using the code as shown below:
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
dfs.period=10
dfs.servers=gmetad_server_ip:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
mapred.period=10
mapred.servers=gmetad_server_ip:8649

jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
jvm.period=10
jvm.servers=gmetad_server_ip:8649

rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
rpc.period=10
rpc.servers=gmetad_server_ip:8649

ugi.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
ugi
.period=10
ugi
.servers=gmetad_server_ip:8649
The above configuration is for the unicastmode of Ganglia. However, if you are running Ganglia in multicast mode then you need to use the multicast address in place of gmetad_server_ip in the configuration file. Once you have applied the above changes, then you need to restart the gmetad and gmond services of Ganglia on the nodes. You also need to restart Hadoop services if they are running. Once you are done with restarting the services, the Ganglia UIdisplays the Hadoop graphs. Initially Ganglia UI does not show graphs for the jobs, they will appear on UI only after submitting a job in Hadoop.

Please feel free to post your comments or log any queries as required.