Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> HDFS data and non-aligned splits


+
John Lilley 2013-05-23, 17:53
+
John Lilley 2013-05-23, 17:59
Copy link to this message
-
Re: HDFS data and non-aligned splits
> What happens when MR produces data splits, and those splits don’t align
on block boundaries?

Answer depends on the file format used here. With any of the formats we
ship, nothing happens.

> but isn’t there always some slop where records straddle the block
boundaries, resulting in an extra HDFS connection just to get the
half-record in the other block?

Yes, but how large is half (or in worst case, the whole) record going to be
in size?

> Does this impact performance?

Its more of an extra, minor DN connection. The perf impact is almost zero
but the format-free loading is a major win in operations. Comparing to
Disco's DDFS for one alternative example, HDFS is much easier here. With
Disco you have to manage your chunking during load time, while with HDFS,
MR libraries need logic based on
http://wiki.apache.org/hadoop/HadoopMapReduce to process those records. You
would at most, depending on how large the records are of course, spend
reading from a few bytes to a few megabytes over the network. If you use
large record sizes, its also a good thing to raise up the file's block size.

> Are there file formats that attempt to enforce data alignment?

I don't think there are any, and there shouldn't be, cause reading them
beyond split boundaries is pretty transparent to application writers. Your
HDFS reader API doesn't require you to be aware of the split.
On Thu, May 23, 2013 at 11:23 PM, John Lilley <[EMAIL PROTECTED]>wrote:

>  What happens when MR produces data splits, and those splits don’t align
> on block boundaries?  I’ve read that MR will attempt to make data splits
> near block boundaries to improve data locality, but isn’t there always some
> slop where records straddle the block boundaries, resulting in an extra
> HDFS connection just to get the half-record in the other block?  Does this
> impact performance?  Are there file formats that attempt to enforce data
> alignment?****
>
> ** **
>

--
Harsh J