Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Re: How to best decide mapper output/reducer input for a huge string?


Copy link to this message
-
Re: How to best decide mapper output/reducer input for a huge string?
Pavan Sudheendra 2013-09-21, 08:17
No, I don't have a combiner in place. Is it necessary? How do I make my map
output compressed? Yes, the Tables in HBase are compressed.

Although, there's no real bottleneck, the time it takes to process the
entire table is huge. I have to constantly check if i can optimize it
somehow..

Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
any thing wrong with my design? Does it require any kind of re-work? Thank
you so much for helping..
On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <[EMAIL PROTECTED]>wrote:

> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?
>
> Have you been able to profile your code to see where the bottlenecks are?
>
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <[EMAIL PROTECTED]>wrote:
>
>> Hi Pradeep,
>> Yes.. Basically i'm only writing the key part as the map output.. The V
>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>> leads to faster execution.. I'm kind of a newbie so looking to make the
>> map/reduce job run a lot faster..
>>
>> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
>> seems if i write a map output for each and every row of a 19 m row HBase
>> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>>
>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>> something if i have the data in that format?
>>
>>
>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <[EMAIL PROTECTED]
>> > wrote:
>>
>>> I'm sorry but I don't understand your question. Is the output of the
>>> mapper you're describing the key portion? If it is the key, then your data
>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>>
>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>> this if you have a need for a non lexical sort order. The
>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>> All KV-pairs from the same group will be given to the same Reducer.
>>>
>>> If your reduce computation needs all the KV-pairs for the same
>>> HouseHoldId, then you will need to write a GroupingComparator.
>>>
>>> Also, have you considered using a higher level abstraction on Hadoop
>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>> a LOT easier to write in these languages.
>>>
>>> Hope this helps!
>>> - Pradeep
>>>
>>>
>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <[EMAIL PROTECTED]>wrote:
>>>
>>>> I need to improve my MR jobs which uses HBase as source as well as
>>>> sink..
>>>>
>>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>>> them out as one huge string for the reducer to do some computation and dump
>>>> into a HBase Table..
>>>>
>>>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>>>
>>>> The output of the mapper is something like this :
>>>>
>>>> HouseHoldId contentID name duration genre type channelId personId televisionID timestamp
>>>>
>>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>>> kind of ignoring it. My mapper class is defined as follows:
>>>>
>>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>>>
>>>> For my MR job to be completed, it takes 22 hours to complete which is
>>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
Regards-
Pavan