Tuesday, July 30, 2013

Building Hadoop from source

Hadoop building process from Source Tree

STEP 1:  Install all required packages in respective OS, I am using Ubuntu 13.04
                 * Unix System
                 * JDK 1.6
                 * Maven 3.0
                 * Findbugs 1.3.9 (if running findbugs)
                 * ProtocolBuffer 2.4.1+ (for MapReduce and HDFS)
                 * CMake 2.6 or newer (if compiling native code) 
                 * Internet connection for first build (to fetch all Maven and Hadoop dependencies)

STEP 2: Download source from Git repo as it most suitable for nightly build
              Git Repo: git://git.apache.org/hadoop-common.git

STEP 3: Build hadoop-maven-plugin
               cd hadoop-maven-plugin
               mvn install
               cd ..
STEP 4:  Optional to Build native libs 
               mvn compile -Pnative
STEP 5:  Optional findbugs (need in case want to create site)
                mvn findbugs:findbugs
STEP 6:  Building distributions:

               Create binary distribution without native code and without documentation:
               $ mvn package -Pdist -DskipTests -Dtar

               Create binary distribution with native code and with documentation:
               $ mvn package -Pdist,native,docs -DskipTests -Dtar

               Create source distribution:
               $ mvn package -Psrc -DskipTests

               Create source and binary distributions with native code and documentation:
               $ mvn package -Pdist,native,docs,src -DskipTests -Dtar

               Create a local staging version of the website (in /tmp/hadoop-site)
               $ mvn clean site; mvn site:stage -DstagingDirectory=/tmp/hadoop-site

    

Reference :  GitHub

Monday, July 29, 2013

Hadoop Compression

      As hadoop is distributed framework for processing large amount of data, It may be in Tera bytes or Peta bytes. To store whole data Hadoop Used HDFS equivalent of GFS of google. allow to process it through various means like MapReduce, HBase, Hive, Pig.
     But this processing incurs large Network I/O. So requires to developer to Minimise data transfer. Hadoop does provides several namely, Gzip, BZip2, LZO, LZ4, Sanappy, DEFLATE.  Except BZip2 hadoop have native implementation, That can count for performance gain. So if you are running  64-bit machine then it would be recommended to build it from source for you platform, Since As of now Hadoop does not come with 64-Bit native libraries. But if you do not have native libraries then hadoop do provides java implementation of BZip2, DEFALTE, Gzip.
    But compression has trade-off, Cause it require CPU time to compress and decompress  at source and destination. So we have to choose algorithm by taking disk space and CPU time requirement.  We need to consider how does compression affects processing, If it support splitting.

How do we choose Compression?

  1. We can user high performance compression like snappy, LZ4 or LZO with Container like sequence file, RCFile, SquenceFile, AVRO file.  Which support logical separation of data stored and hence compression and splitting. HBase recommends snappy as it's the fastest of all and HBase is for nearly real time  access pattern.
  2. Use compression that support splitting(BZip), or can be indexed like LZO. This may allow to process all splits separately.


 
Note:
  •  Fast storage like SSD are more expensive than traditional HDD, and HDD's are more cost effective as capacity increases. So it would be better to chose algorithm that give best compression and Decompression than lesser space.
  •   SequenceFileFormat: we can use "mapreduce.output.fileoutputformat.compress.type" property to control compression. by default it is per record. Also it supports block compression, which is preferred record compression. Block size can be defined in bytes by "io.seqfile.compress.blocksize".




References:
 
             

Monday, July 15, 2013

Hadoop Testing statagies

    Importance of effective testing strategy can be understand from the fact that normally developer spend more time in testing and debugging than actual coding. So is for hadoop, But hadoop or any distributed system like R, MPI even Multi-Threaded application comes with it's own problems. Because in real time we can not capture state of distributed environment. This makes us to develop even more powerful testing. But here we will talk only about hadoop.

    ClusterMapReduceTestCase:  Hadoop community is putting lot of effort to approach this very problem by developing MiniMRCluster and MiniDFSCluster for testing purpose. This allow developer to test all aspects of hadoop applications like MapReduce, Yarn application, HBase, etc... Just like real environment. But it has one big disadvantage it takes about about 10 odd second to setup test cluster, and may discourage many developer to test all functionality.

    MRUnit:  To address this cloudera developed MRUnit framework which now apache TLP(Top Level Project) with version 1.0.0.  The MRUnit bridges gap between MapReduce and JUnit testing by providing interfaces to test MapReduce jobs.  We can use functionality provided by MRUnit in JUnit test and test Mapper,  Reducer, Driver and Combiner. With PipelineMapreduceDriver we can test workflow of series of MapReduce Jobs. Currently it does not allows to test partitioner. Also it is not efficient to test whole of infrastructure of MaprReduce Jobs, Since job may contain custom Input/Output Formats, Custom Record Reader/Writer, Data serialization etc...

    LocalJobRunner:   Hadoop comes with LocalJobRunner that runs on single JVM and allows to test complete Hadoop stack.  

    Mockito:   Above all frameworks approach  to test different component by creating test environment, So it makes testing very slow. Most test are functional test or end-to-end test, So for efficiency we can use Mocking framework like Mockito, to mocking out functional components of MapReduce Jobs.

    Note: Now Hadoop also uses Mockito framework for testing: Hadoop JIRA.

    JUnit Test: And other non functions like  parsing inputs can be tested separately.






   

Wednesday, July 10, 2013

Maven switching Multi-Environment

Practical usecase scenario is : If we are working for multiple software vendor, we might need to
switch maven environment setting to customer's configuration settings and vice versa.
This can be done by
1. Maven cli with -s, --settings <args>
    mvn -s customerXXXMvnSetting.xml
2. with git.
    Initialize git repo in maven setting directory and create branch for different customers for example
    git checkout -b customerXXXMvnSetting

Link