|
|
-
MapReduce processing with extra (possibly non-serializable) configuration
Public Network Services 2013-02-21, 21:10
Hi...
I am trying to put an existing file processing application into Hadoop and need to find the best way of propagating some extra configuration per split, in the form of complex and proprietary custom Java objects.
The general idea is
1. A custom InputFormat splits the input data 2. The same InputFormat prepares the appropriate configuration for each split 3. Hadoop processes each split in MapReduce, using the split itself and the corresponding configuration
The problem is that these configuration objects contain a lot of properties and references to other complex objects, and so on, therefore it will take a lot of work to cover all the possible combinations and make the whole thing serializable (if it can be done in the first place).
Most probably this is the only way forward, but if anyone has ever dealt with this problem, please suggest the best approach to follow.
Thanks!
+
Public Network Services 2013-02-21, 21:10
-
Re: MapReduce processing with extra (possibly non-serializable) configuration
Azuryy Yu 2013-02-22, 01:57
I just have one simple suggestion for you: writer an customer split to replace FileSplit, include all your special configurations in this split. then write a custom InputFormat.
during map phrase, you can get this split, then you get all special configurations.
On Fri, Feb 22, 2013 at 5:10 AM, Public Network Services < [EMAIL PROTECTED]> wrote:
> Hi... > > I am trying to put an existing file processing application into Hadoop and > need to find the best way of propagating some extra configuration per > split, in the form of complex and proprietary custom Java objects. > > The general idea is > > 1. A custom InputFormat splits the input data > 2. The same InputFormat prepares the appropriate configuration for > each split > 3. Hadoop processes each split in MapReduce, using the split itself > and the corresponding configuration > > The problem is that these configuration objects contain a lot of > properties and references to other complex objects, and so on, therefore it > will take a lot of work to cover all the possible combinations and make the > whole thing serializable (if it can be done in the first place). > > Most probably this is the only way forward, but if anyone has ever dealt > with this problem, please suggest the best approach to follow. > > Thanks! > >
+
Azuryy Yu 2013-02-22, 01:57
-
Re: MapReduce processing with extra (possibly non-serializable) configuration
Public Network Services 2013-02-22, 04:11
You mean save the serialized configuration object in the custom split file, retrieve that in the Mapper, reconstruct the configuration and use the rest of the split file (i.e., the actual data) as input to the map function? On Thu, Feb 21, 2013 at 5:57 PM, Azuryy Yu <[EMAIL PROTECTED]> wrote:
> I just have one simple suggestion for you: writer an customer split to > replace FileSplit, include all your special configurations in this split. > then write a custom InputFormat. > > during map phrase, you can get this split, then you get all special > configurations. > > > > On Fri, Feb 22, 2013 at 5:10 AM, Public Network Services < > [EMAIL PROTECTED]> wrote: > >> Hi... >> >> I am trying to put an existing file processing application into Hadoop >> and need to find the best way of propagating some extra configuration per >> split, in the form of complex and proprietary custom Java objects. >> >> The general idea is >> >> 1. A custom InputFormat splits the input data >> 2. The same InputFormat prepares the appropriate configuration for >> each split >> 3. Hadoop processes each split in MapReduce, using the split itself >> and the corresponding configuration >> >> The problem is that these configuration objects contain a lot of >> properties and references to other complex objects, and so on, therefore it >> will take a lot of work to cover all the possible combinations and make the >> whole thing serializable (if it can be done in the first place). >> >> Most probably this is the only way forward, but if anyone has ever dealt >> with this problem, please suggest the best approach to follow. >> >> Thanks! >> >> >
+
Public Network Services 2013-02-22, 04:11
-
Re: MapReduce processing with extra (possibly non-serializable) configuration
feng lu 2013-02-22, 01:55
Hi May be you can see the useage of DistributedCache [0] , It's a facility provided by the MR framework to cache files (text,archives, jars etc) needed by applications. [0] http://hadoop.apache.org/docs/current/api/org/apache/hadoop/filecache/DistributedCache.htmlOn Fri, Feb 22, 2013 at 5:10 AM, Public Network Services < [EMAIL PROTECTED]> wrote: > Hi... > > I am trying to put an existing file processing application into Hadoop and > need to find the best way of propagating some extra configuration per > split, in the form of complex and proprietary custom Java objects. > > The general idea is > > 1. A custom InputFormat splits the input data > 2. The same InputFormat prepares the appropriate configuration for > each split > 3. Hadoop processes each split in MapReduce, using the split itself > and the corresponding configuration > > The problem is that these configuration objects contain a lot of > properties and references to other complex objects, and so on, therefore it > will take a lot of work to cover all the possible combinations and make the > whole thing serializable (if it can be done in the first place). > > Most probably this is the only way forward, but if anyone has ever dealt > with this problem, please suggest the best approach to follow. > > Thanks! > > -- Don't Grow Old, Grow Up... :-)
+
feng lu 2013-02-22, 01:55
-
Re: MapReduce processing with extra (possibly non-serializable) configuration
Public Network Services 2013-02-22, 04:09
I have considered the DistributedCache and will probably be using it, but in order to have a file to cache I need to serialize the configuration object first. :-) On Thu, Feb 21, 2013 at 5:55 PM, feng lu <[EMAIL PROTECTED]> wrote: > Hi > > May be you can see the useage of DistributedCache [0] , It's a facility > provided by the MR framework to cache files (text,archives, jars etc) > needed by applications. > > [0] > http://hadoop.apache.org/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html> > > On Fri, Feb 22, 2013 at 5:10 AM, Public Network Services < > [EMAIL PROTECTED]> wrote: > >> Hi... >> >> I am trying to put an existing file processing application into Hadoop >> and need to find the best way of propagating some extra configuration per >> split, in the form of complex and proprietary custom Java objects. >> >> The general idea is >> >> 1. A custom InputFormat splits the input data >> 2. The same InputFormat prepares the appropriate configuration for >> each split >> 3. Hadoop processes each split in MapReduce, using the split itself >> and the corresponding configuration >> >> The problem is that these configuration objects contain a lot of >> properties and references to other complex objects, and so on, therefore it >> will take a lot of work to cover all the possible combinations and make the >> whole thing serializable (if it can be done in the first place). >> >> Most probably this is the only way forward, but if anyone has ever dealt >> with this problem, please suggest the best approach to follow. >> >> Thanks! >> >> > > > -- > Don't Grow Old, Grow Up... :-) >
+
Public Network Services 2013-02-22, 04:09
-
Re: MapReduce processing with extra (possibly non-serializable) configuration
Harsh J 2013-02-22, 06:15
How do you imagine sending "data" of any kind (be it in object form, etc.) over the network to other nodes, without implementing or relying on a serialization for it? Are you looking for "easy" Java ways such as the distributed cache from Hazelcast, etc., where this may be taken care for you automatically in some way? :)
On Fri, Feb 22, 2013 at 2:40 AM, Public Network Services <[EMAIL PROTECTED]> wrote: > Hi... > > I am trying to put an existing file processing application into Hadoop and > need to find the best way of propagating some extra configuration per split, > in the form of complex and proprietary custom Java objects. > > The general idea is > > A custom InputFormat splits the input data > The same InputFormat prepares the appropriate configuration for each split > Hadoop processes each split in MapReduce, using the split itself and the > corresponding configuration > > The problem is that these configuration objects contain a lot of properties > and references to other complex objects, and so on, therefore it will take a > lot of work to cover all the possible combinations and make the whole thing > serializable (if it can be done in the first place). > > Most probably this is the only way forward, but if anyone has ever dealt > with this problem, please suggest the best approach to follow. > > Thanks! >
-- Harsh J
+
Harsh J 2013-02-22, 06:15
-
Re: MapReduce processing with extra (possibly non-serializable) configuration
Public Network Services 2013-02-22, 06:26
I am familiar with serialization solutions and have done quite some work in this area, but wanted to confirm that I need to follow that path.
Thanks for the advice! :-) On Thu, Feb 21, 2013 at 10:24 PM, feng lu <[EMAIL PROTECTED]> wrote:
> yes, you are right. First upload serialized configuration file to HDFS > and retrieve that file in the Mapper#configure method for each Mapper, and > deserialize the file to configuration object. > > It seem that the configuration file serialization is required. You can > find many data serialization system such as avro,protobuf and etc. > > On Fri, Feb 22, 2013 at 12:11 PM, Public Network Services < > [EMAIL PROTECTED]> wrote: > >> You mean save the serialized configuration object in the custom split >> file, retrieve that in the Mapper, reconstruct the configuration and use >> the rest of the split file (i.e., the actual data) as input to the map >> function? >> >> >> On Thu, Feb 21, 2013 at 5:57 PM, Azuryy Yu <[EMAIL PROTECTED]> wrote: >> >>> I just have one simple suggestion for you: writer an customer split to >>> replace FileSplit, include all your special configurations in this split. >>> then write a custom InputFormat. >>> >>> during map phrase, you can get this split, then you get all special >>> configurations. >>> >>> >>> >>> On Fri, Feb 22, 2013 at 5:10 AM, Public Network Services < >>> [EMAIL PROTECTED]> wrote: >>> >>>> Hi... >>>> >>>> I am trying to put an existing file processing application into Hadoop >>>> and need to find the best way of propagating some extra configuration per >>>> split, in the form of complex and proprietary custom Java objects. >>>> >>>> The general idea is >>>> >>>> 1. A custom InputFormat splits the input data >>>> 2. The same InputFormat prepares the appropriate configuration for >>>> each split >>>> 3. Hadoop processes each split in MapReduce, using the split itself >>>> and the corresponding configuration >>>> >>>> The problem is that these configuration objects contain a lot of >>>> properties and references to other complex objects, and so on, therefore it >>>> will take a lot of work to cover all the possible combinations and make the >>>> whole thing serializable (if it can be done in the first place). >>>> >>>> Most probably this is the only way forward, but if anyone has ever >>>> dealt with this problem, please suggest the best approach to follow. >>>> >>>> Thanks! >>>> >>>> >>> >> > > > -- > Don't Grow Old, Grow Up... :-) >
+
Public Network Services 2013-02-22, 06:26
|
|