Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - MapReduce processing with extra (possibly non-serializable) configuration


+
Public Network Services 2013-02-21, 21:10
+
Azuryy Yu 2013-02-22, 01:57
+
Public Network Services 2013-02-22, 04:11
+
feng lu 2013-02-22, 01:55
+
Public Network Services 2013-02-22, 04:09
+
Harsh J 2013-02-22, 06:15
Copy link to this message
-
Re: MapReduce processing with extra (possibly non-serializable) configuration
Public Network Services 2013-02-22, 06:26
I am familiar with serialization solutions and have done quite some work in
this area, but wanted to confirm that I need to follow that path.

Thanks for the advice! :-)
On Thu, Feb 21, 2013 at 10:24 PM, feng lu <[EMAIL PROTECTED]> wrote:

> yes, you are right. First upload serialized configuration file to HDFS
> and retrieve that file in the Mapper#configure method for each Mapper, and
> deserialize the file to configuration object.
>
> It seem that the configuration file serialization is required. You can
> find many data serialization system such as avro,protobuf and etc.
>
> On Fri, Feb 22, 2013 at 12:11 PM, Public Network Services <
> [EMAIL PROTECTED]> wrote:
>
>> You mean save the serialized configuration object in the custom split
>> file, retrieve that in the Mapper, reconstruct the configuration and use
>> the rest of the split file (i.e., the actual data) as input to the map
>> function?
>>
>>
>> On Thu, Feb 21, 2013 at 5:57 PM, Azuryy Yu <[EMAIL PROTECTED]> wrote:
>>
>>> I just have one simple suggestion for you: writer an customer split to
>>> replace FileSplit, include all your special configurations in this split.
>>> then write a custom InputFormat.
>>>
>>> during map phrase, you can get this split, then you get all special
>>> configurations.
>>>
>>>
>>>
>>> On Fri, Feb 22, 2013 at 5:10 AM, Public Network Services <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> Hi...
>>>>
>>>> I am trying to put an existing file processing application into Hadoop
>>>> and need to find the best way of propagating some extra configuration per
>>>> split, in the form of complex and proprietary custom Java objects.
>>>>
>>>> The general idea is
>>>>
>>>>    1. A custom InputFormat splits the input data
>>>>    2. The same InputFormat prepares the appropriate configuration for
>>>>    each split
>>>>    3. Hadoop processes each split in MapReduce, using the split itself
>>>>    and the corresponding configuration
>>>>
>>>> The problem is that these configuration objects contain a lot of
>>>> properties and references to other complex objects, and so on, therefore it
>>>> will take a lot of work to cover all the possible combinations and make the
>>>> whole thing serializable (if it can be done in the first place).
>>>>
>>>> Most probably this is the only way forward, but if anyone has ever
>>>> dealt with this problem, please suggest the best approach to follow.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>
>>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>