Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # dev >> record-aware file splitting


+
Daren Hasenkamp 2010-06-01, 20:56
Copy link to this message
-
Re: record-aware file splitting
Hey Daren,

Your idea has some pedigree in the Hadoop universe: it was proposed in early
2006 at https://issues.apache.org/jira/browse/HADOOP-106 and closed as
"won't fix". The suggestion there is to pad out the rest of the block for
very large records, as the complexity added to the file system for splitting
blocks on record boundaries is high.

That said, if you feel strongly about your direction, feel free to open a
new JIRA issue, link it to the old one, and have at your argument. You may
also be interested in authoring a HEP (see
http://www.cloudera.com/blog/2010/06/the-second-apache-hadoop-hdfs-and-mapreduce-contributors-meeting)
describing your intent.

One last recommendation: I note that you have not made large changes to the
Hadoop code base yet. You may want to start with a slightly smaller project
to get your feet wet. Lots of folks would be happy to guide you to an
appropriate project.

Regards,
Jeff

On Tue, Jun 1, 2010 at 1:56 PM, Daren Hasenkamp <[EMAIL PROTECTED]>wrote:

> Hi,
>
> I am interested in implementing record-aware file splitting for hadoop. I
> am looking for someone who knows the hadoop internals well and is willing
> to discuss some details of how to accomplish this.
>
> By "record-aware file splitting", I mean that I want to be able to put
> files into hadoop with a custom InputFormat implementation, and hadoop
> will split the files into blocks such that no record is split between
> blocks.
>
> I believe that record-aware file splitting could offer considerable
> speedup when dealing with large records--say, 10s or 100s of megabytes per
> record--since it eliminates the need to stream part of a record from one
> datanode to another when said record is split between block boundaries.
>
> (The motivation here is that large records occur commonly when dealing
> with scientific datasets. Imagine, for example, a set of climate
> simulation data, where each "record" consists of climate data over the
> entire globe at a given time step. This is a huge amount of data per
> record. Essentially, I want to modify Hadoop to work faster with large
> scientific datasets.)
>
> If you are interested in discussing this with me, I would love to talk
> more with you.
>
> Thanks!
> Daren Hasenkamp
> Computer Science/Applied Mathematics, UC Berkeley
> Student Assistant, Lawrence Berkeley National Lab
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB