Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> HDFS data and non-aligned splits

Copy link to this message
RE: HDFS data and non-aligned splits
Related to this, I see in the elephant book under "Which compression format should I use":
"Use a container file format such as Sequence File..."
Does Sequence File attempt to align compressed data on block boundaries?

From: John Lilley [mailto:[EMAIL PROTECTED]]
Sent: Thursday, May 23, 2013 11:53 AM
Subject: HDFS data and non-aligned splits

What happens when MR produces data splits, and those splits don't align on block boundaries?  I've read that MR will attempt to make data splits near block boundaries to improve data locality, but isn't there always some slop where records straddle the block boundaries, resulting in an extra HDFS connection just to get the half-record in the other block?  Does this impact performance?  Are there file formats that attempt to enforce data alignment?