Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> M/R file gather/scatter issue


Copy link to this message
-
Re: M/R file gather/scatter issue
Seems like CombineFileInputFormat.createPool() might help here.  But I'm
a little unclear on usage.  That method is protected ... and so then I
guess only accessible by subclasses?

Can anyone advise on usage here?

Thanks,

DR

On 12/08/2010 11:25 AM, David Rosenstrauch wrote:
> Bit of a snag here:
>
> Since I'm thinking this app needs to use CombineFileInputFormat (since
> lots of small files) this throws a wrench into these plans a bit.
> CombineFileInputFormat creates CombineFileSplit's, not FileSplit's. And
> CombineFileSplit only contains a list of all the file paths whose data
> is included in the split, but no way to identify which file path a
> particular record came from.
>
> Any workaround here?
>
> Thanks,
>
> DR
>
> On 12/07/2010 11:08 PM, David Rosenstrauch wrote:
>> Thanks for the suggestion Shrijeet.
>>
>> Same thought occurred to me on the way home from work after I sent this
>> mail. Not sure why, but my brain was kinda locked onto the concept of
>> the mapper being a no-op in this situation. Obviously doesn't have to be.
>>
>> Let me try hacking this together and see how it goes. Thanks again much
>> for helping clarify my thinking.
>>
>> DR
>>
>> On 12/07/2010 07:02 PM, Shrijeet Paliwal wrote:
>>> ammm, how about modifying the key that you collect in the mapper to
>>> include some *additional* information (like filename) to hint reducer
>>> about records origin?
>>>
>>> -Shrijeet
>>>
>>> On Tue, Dec 7, 2010 at 3:43 PM, David Rosenstrauch<[EMAIL PROTECTED]>
>>> wrote:
>>>> Having an issue with some SequenceFiles that I generated, and I'm
>>>> trying to
>>>> write a M/R job to fix them.
>>>>
>>>> Situation is roughly this:
>>>>
>>>> I have a bunch of directories in HDFS, each of which contains a set
>>>> of 7
>>>> sequence files. Each sequence file is of a different "type", but the
>>>> key
>>>> type is the same across all of the sequence files. The value types -
>>>> which
>>>> are compressed - are also the same when in compressed form (i.e.,
>>>> BytesWritable), though the different record types are obviously
>>>> different
>>>> when uncompressed.
>>>>
>>>> I want to write a job to fix some problems in the files. My thinking is
>>>> that I can feed all the data from all the files into a M/R job (i.e.,
>>>> gather), re-sort/partition the data properly, perform some additional
>>>> cleanup/fixup in the reducer, and then write the data back out to a
>>>> new set
>>>> of files (i.e., scatter).
>>>>
>>>>
>>>> Been digging through the API's, and it looks like
>>>> CombineFileInputFormat /
>>>> CombineFileRecordReader might be the way to go here. It'd let me
>>>> merge the
>>>> whole load of data from each of the (small) files into one M/R job
>>>> in an
>>>> efficient way.
>>>>
>>>> Sorting would then occur by key, as would partitioning, so I'm still
>>>> good so
>>>> far.
>>>>
>>>> Problem, however, is when I get to the reducer. The reducer needs to
>>>> know
>>>> which type of file data (i.e., which type of source file) a record
>>>> came from
>>>> so that it can a) uncompress/deserialize the data correctly, and b)
>>>> scatter
>>>> it out to the correct type of output file.
>>>>
>>>> I'm not entirely clear how to make this happen. It seems like the
>>>> source
>>>> file information (which looks like it might exist on the
>>>> CombineFileSplit)
>>>> is no longer available by the time it gets to the reducer. And if the
>>>> reducer doesn't know which file a given record came from, it won't
>>>> know how
>>>> to process it properly.
>>>>
>>>> Can anyone lend some suggestions on how to code this solution? Am I
>>>> on the
>>>> right track with the CombineFileInputFormat / CombineFileRecordReader
>>>> approach? If so, then how might I make the reducer code aware of the
>>>> source
>>>> of the record(s) it's currently processing?
>>>>
>>>> TIA!
>>>>
>>>> DR
>>>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB