Monday, July 29, 2013

Hadoop Compression

      As hadoop is distributed framework for processing large amount of data, It may be in Tera bytes or Peta bytes. To store whole data Hadoop Used HDFS equivalent of GFS of google. allow to process it through various means like MapReduce, HBase, Hive, Pig.
     But this processing incurs large Network I/O. So requires to developer to Minimise data transfer. Hadoop does provides several namely, Gzip, BZip2, LZO, LZ4, Sanappy, DEFLATE.  Except BZip2 hadoop have native implementation, That can count for performance gain. So if you are running  64-bit machine then it would be recommended to build it from source for you platform, Since As of now Hadoop does not come with 64-Bit native libraries. But if you do not have native libraries then hadoop do provides java implementation of BZip2, DEFALTE, Gzip.
    But compression has trade-off, Cause it require CPU time to compress and decompress  at source and destination. So we have to choose algorithm by taking disk space and CPU time requirement.  We need to consider how does compression affects processing, If it support splitting.

How do we choose Compression?

  1. We can user high performance compression like snappy, LZ4 or LZO with Container like sequence file, RCFile, SquenceFile, AVRO file.  Which support logical separation of data stored and hence compression and splitting. HBase recommends snappy as it's the fastest of all and HBase is for nearly real time  access pattern.
  2. Use compression that support splitting(BZip), or can be indexed like LZO. This may allow to process all splits separately.


 
Note:
  •  Fast storage like SSD are more expensive than traditional HDD, and HDD's are more cost effective as capacity increases. So it would be better to chose algorithm that give best compression and Decompression than lesser space.
  •   SequenceFileFormat: we can use "mapreduce.output.fileoutputformat.compress.type" property to control compression. by default it is per record. Also it supports block compression, which is preferred record compression. Block size can be defined in bytes by "io.seqfile.compress.blocksize".




References:
 
             

No comments:

Post a Comment