Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> HDFS data and non-aligned splits


Copy link to this message
-
RE: HDFS data and non-aligned splits
Related to this, I see in the elephant book under "Which compression format should I use":
"Use a container file format such as Sequence File..."
Does Sequence File attempt to align compressed data on block boundaries?

From: John Lilley [mailto:[EMAIL PROTECTED]]
Sent: Thursday, May 23, 2013 11:53 AM
To: [EMAIL PROTECTED]
Subject: HDFS data and non-aligned splits

What happens when MR produces data splits, and those splits don't align on block boundaries?  I've read that MR will attempt to make data splits near block boundaries to improve data locality, but isn't there always some slop where records straddle the block boundaries, resulting in an extra HDFS connection just to get the half-record in the other block?  Does this impact performance?  Are there file formats that attempt to enforce data alignment?

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB