Shahab Yunus 2013-08-31, 02:42
-Re: Job config before read fields
Adrian CAPDEFIER 2013-09-09, 18:00
Sorry about the late reply, a personal matter came up and it took most of
my time. Thank you for your replies.
The solution I chose was to temporarily transfer the metadata along with
the data and then restore it on the reduce nodes. This works from a
functional perspective as long as there are no performance requirements and
it will have to do for now.
The permanent solution will likely involve tweaking hadoop, but that is a
different kettle of fish.
On Sun, Sep 1, 2013 at 12:48 AM, Shahab Yunus <[EMAIL PROTECTED]>wrote:
> Personally, I don't know a way to access job configuration parameters in
> custom implementation of Writables ( at least not an elegant and
> appropriate one. Of course hacks of various kinds be done.) Maybe experts
> can chime in?
> One idea that I thought about was to use MapWritable (if you have not
> explored it already.) You can encode the 'custom metadata' for you 'data'
> as one byte symbols and move your data in the M/R flow as a map. Then while
> deserialization you will have the type (or your 'custom metadata') in the
> key part of the map and the value would be you actual data. This aligns
> with the efficient approach that is used natively in Hadoop for
> Strings/Text i.e. compact metadata (though I agree that you are not taking
> advantage of the other aspect of non-dependence between metadata and the
> data it defines.)
> Take a look at that:
> Page 96 of the Definitive Guide:
> and then this:
> and add your own custom types here (note that you are restricted by size
> of byte):
> On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER <[EMAIL PROTECTED]>wrote:
>> Thank you for your help Shahab.
>> I guess I wasn't being too clear. My logic is that I use a custom type as
>> key and in order to deserialize it on the compute nodes, I need an extra
>> piece of information (also a custom type).
>> To use an analogy, a Text is serialized by writing the length of the
>> string as a number and then the bytes that compose the actual string. When
>> it is deserialized, the number informs the reader when to stop reading the
>> string. This number is varies from string to string and it is compact so it
>> makes sense to serialize it with the string.
>> My use case is similar to it. I have a complex type (let's call this
>> data), and in order to deserialize it, I need another complex type (let's
>> call this second type metadata). The metadata is not closely tied to the
>> data (i.e. if the data value changes, the metadata does not) and the
>> metadata size is quite large.
>> I ruled out a couple of options, but please let me know if you think I
>> did so for the wrong reasons:
>> 1. I could serialize each data value with it's own metadata value, but
>> since the data value count is in the +tens of millions and the metadata
>> value distinct count can be up to one hundred, it would waste resources in
>> the system.
>> 2. I could serialize the metadata and then the data as a collection
>> property of the metadata. This would be an elegant solution code-wise, but
>> then all the data would have to be read and kept in memory as a massive
>> object before any reduce operations can happen. I wasn't able to find any
>> info on this online so this is just a guess from peeking at the hadoop code.
>> My "solution" was to serialize the data with a hash of the metadata and
>> separately serialize the metadata and its hash in the job configuration (as
>> key/value pairs). For this to work, I would need to be able to deserialize