Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Map/Reduce and sequence file metadata...


Copy link to this message
-
RE: Map/Reduce and sequence file metadata...
Andy Sautins 2009-10-02, 20:48

  Thanks for the response Tom.  I'll probably try the approach of extending SequenceFileOutputFormat to write sequence file metadata.

  What I am getting from your response is that it doesn't seem like using the sequence file metadata is that common, especially for sequence files generated as map/reduce output.  Sounds like using MultipleInput and having files in different locations is a more common way of addressing having different file types fed into the same job.  Does that sound right?

   Thanks again for the insight.

-----Original Message-----
From: Tom White [mailto:[EMAIL PROTECTED]]
Sent: Friday, October 02, 2009 3:26 AM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: Map/Reduce and sequence file metadata...

On Thu, Oct 1, 2009 at 5:10 PM, Andy Sautins
<[EMAIL PROTECTED]> wrote:
>
>   Hi all. I'm struggling a bit to figure this out and wondering if anyone had any  pointers.
>
>   I'm using SequenceFiles as output from a MapReduce job ( using SequenceFileOutputFormat ) and then in a followup MapReduce job reading in the results using SequenceFileInputFormat.  All seems to work fine.  What I haven't figured out is how to write the SequenceFile.Metadata in the SequenceFileOutputFormat and then read the metadata in SequenceFileInputFormat.  Is that possible to do using the new mapreduce.* API?

By default no SequenceFile metadata is written by
SequenceFileOutputFormat. SequenceFile metadata is written at the
beginning of the file, so it needs to be passed in when the
SequenceFile is opened. One way of doing this would be to extend
SequenceFileOutputFormat and override the getSequenceWriter() method
to call the SequenceFile.createWriter() factory method that takes
metadata.

>
>   I have two types of files I want to process in the Mapper.  Currently I'm using the  context.getInputSplit() and parsing the resulting fileSplit.getPath() to determine what file I'm processing.  It seems cleaner to use the SequenceFile.Metadata if I can.  Does that make sense or am I off in the weeds?

Another approach would be to use MultipleInputs which allows you to
use different mappers for different input paths. Could this help?

>
>   Thanks
>
>   Andy
>