Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - Re: Job config before read fields


+
Shahab Yunus 2013-08-31, 02:42
Copy link to this message
-
Re: Job config before read fields
Adrian CAPDEFIER 2013-09-09, 18:00
Hi Shahab,

Sorry about the late reply, a personal matter came up and it took most of
my time. Thank you for your replies.

The solution I chose was to temporarily transfer the metadata along with
the data and then restore it on the reduce nodes. This works from a
functional perspective as long as there are no performance requirements and
it will have to do for now.

The permanent solution will likely involve tweaking hadoop, but that is a
different kettle of fish.
On Sun, Sep 1, 2013 at 12:48 AM, Shahab Yunus <[EMAIL PROTECTED]>wrote:

> Personally, I don't know a way to access job configuration parameters in
> custom implementation of Writables ( at least not an elegant and
> appropriate one. Of course hacks of various kinds be done.) Maybe experts
> can chime in?
>
> One idea that I thought about was to use MapWritable (if you have not
> explored it already.) You can encode the 'custom metadata' for you 'data'
> as one byte symbols and move your data in the M/R flow as a map. Then while
> deserialization you will have the type (or your 'custom metadata') in the
> key part of the map and the value would be you actual data. This aligns
> with the efficient approach that is used natively in Hadoop for
> Strings/Text i.e. compact metadata (though I agree that you are not taking
> advantage of the other aspect of non-dependence between metadata and the
> data it defines.)
>
> Take a look at that:
> Page 96 of the Definitive Guide:
>
> http://books.google.com/books?id=Nff49D7vnJcC&pg=PA96&lpg=PA96&dq=mapwritable+in+hadoop&source=bl&ots=IiixYu7vXu&sig=4V6H7cY-MrNT7Rzs3WlODsDOoP4&hl=en&sa=X&ei=aX4iUp2YGoaosASs_YCACQ&sqi=2&ved=0CFUQ6AEwBA#v=onepage&q=mapwritable%20in%20hadoop&f=false
>
> and then this:
>
> http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html
>
> and add your own custom types here (note that you are restricted by size
> of byte):
>
> http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html
>
> Regards,
> Shahab
>
>
> On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER <[EMAIL PROTECTED]>wrote:
>
>> Thank you for your help Shahab.
>>
>> I guess I wasn't being too clear. My logic is that I use a custom type as
>> key and in order to deserialize it on the compute nodes, I need an extra
>> piece of information (also a custom type).
>>
>> To use an analogy, a Text is serialized by writing the length of the
>> string as a number and then the bytes that compose the actual string. When
>> it is deserialized, the number informs the reader when to stop reading the
>> string. This number is varies from string to string and it is compact so it
>> makes sense to serialize it with the string.
>>
>> My use case is similar to it. I have a complex type (let's call this
>> data), and in order to deserialize it, I need another complex type (let's
>> call this second type metadata). The metadata is not closely tied to the
>> data (i.e. if the data value changes, the metadata does not) and the
>> metadata size is quite large.
>>
>> I ruled out a couple of options, but please let me know if you think I
>> did so for the wrong reasons:
>> 1. I could serialize each data value with it's own metadata value, but
>> since the data value count is in the +tens of millions and the metadata
>> value distinct count can be up to one hundred, it would waste resources in
>> the system.
>> 2. I could serialize the metadata and then the data as a collection
>> property of the metadata. This would be an elegant solution code-wise, but
>> then all the data would have to be read and kept in memory as a massive
>> object before any reduce operations can happen. I wasn't able to find any
>> info on this online so this is just a guess from peeking at the hadoop code.
>>
>> My "solution" was to serialize the data with a hash of the metadata and
>> separately serialize the metadata and its hash in the job configuration (as
>> key/value pairs). For this to work, I would need to be able to deserialize