Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # dev - record-aware file splitting

Copy link to this message
Re: record-aware file splitting
Jeff Hammerbacher 2010-06-03, 08:38
Hey Daren,

Your idea has some pedigree in the Hadoop universe: it was proposed in early
2006 at https://issues.apache.org/jira/browse/HADOOP-106 and closed as
"won't fix". The suggestion there is to pad out the rest of the block for
very large records, as the complexity added to the file system for splitting
blocks on record boundaries is high.

That said, if you feel strongly about your direction, feel free to open a
new JIRA issue, link it to the old one, and have at your argument. You may
also be interested in authoring a HEP (see
describing your intent.

One last recommendation: I note that you have not made large changes to the
Hadoop code base yet. You may want to start with a slightly smaller project
to get your feet wet. Lots of folks would be happy to guide you to an
appropriate project.


On Tue, Jun 1, 2010 at 1:56 PM, Daren Hasenkamp <[EMAIL PROTECTED]>wrote:

> Hi,
> I am interested in implementing record-aware file splitting for hadoop. I
> am looking for someone who knows the hadoop internals well and is willing
> to discuss some details of how to accomplish this.
> By "record-aware file splitting", I mean that I want to be able to put
> files into hadoop with a custom InputFormat implementation, and hadoop
> will split the files into blocks such that no record is split between
> blocks.
> I believe that record-aware file splitting could offer considerable
> speedup when dealing with large records--say, 10s or 100s of megabytes per
> record--since it eliminates the need to stream part of a record from one
> datanode to another when said record is split between block boundaries.
> (The motivation here is that large records occur commonly when dealing
> with scientific datasets. Imagine, for example, a set of climate
> simulation data, where each "record" consists of climate data over the
> entire globe at a given time step. This is a huge amount of data per
> record. Essentially, I want to modify Hadoop to work faster with large
> scientific datasets.)
> If you are interested in discussing this with me, I would love to talk
> more with you.
> Thanks!
> Daren Hasenkamp
> Computer Science/Applied Mathematics, UC Berkeley
> Student Assistant, Lawrence Berkeley National Lab