Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> MapReduce processing with extra (possibly non-serializable) configuration


Copy link to this message
-
Re: MapReduce processing with extra (possibly non-serializable) configuration
I have considered the DistributedCache and will probably be using it, but
in order to have a file to cache I need to serialize the configuration
object first. :-)
On Thu, Feb 21, 2013 at 5:55 PM, feng lu <[EMAIL PROTECTED]> wrote:

> Hi
>
> May be you can see the useage of DistributedCache [0] , It's a facility
> provided by the MR framework  to cache files (text,archives, jars etc)
> needed by applications.
>
> [0]
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
>
>
> On Fri, Feb 22, 2013 at 5:10 AM, Public Network Services <
> [EMAIL PROTECTED]> wrote:
>
>> Hi...
>>
>> I am trying to put an existing file processing application into Hadoop
>> and need to find the best way of propagating some extra configuration per
>> split, in the form of complex and proprietary custom Java objects.
>>
>> The general idea is
>>
>>    1. A custom InputFormat splits the input data
>>    2. The same InputFormat prepares the appropriate configuration for
>>    each split
>>    3. Hadoop processes each split in MapReduce, using the split itself
>>    and the corresponding configuration
>>
>> The problem is that these configuration objects contain a lot of
>> properties and references to other complex objects, and so on, therefore it
>> will take a lot of work to cover all the possible combinations and make the
>> whole thing serializable (if it can be done in the first place).
>>
>> Most probably this is the only way forward, but if anyone has ever dealt
>> with this problem, please suggest the best approach to follow.
>>
>> Thanks!
>>
>>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>