-RE: HDFS data and non-aligned splits
John Lilley 2013-05-23, 17:59
Related to this, I see in the elephant book under "Which compression format should I use":
"Use a container file format such as Sequence File..."
Does Sequence File attempt to align compressed data on block boundaries?
From: John Lilley [mailto:[EMAIL PROTECTED]]
Sent: Thursday, May 23, 2013 11:53 AM
To: [EMAIL PROTECTED]
Subject: HDFS data and non-aligned splits
What happens when MR produces data splits, and those splits don't align on block boundaries? I've read that MR will attempt to make data splits near block boundaries to improve data locality, but isn't there always some slop where records straddle the block boundaries, resulting in an extra HDFS connection just to get the half-record in the other block? Does this impact performance? Are there file formats that attempt to enforce data alignment?