Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> How to best decide mapper output/reducer input for a huge string?


Copy link to this message
-
Re: How to best decide mapper output/reducer input for a huge string?
One mapper is spawned per hbase table region. You can use the admin ui of
hbase to find the number of regions per table. It might happen that all the
data is sitting in a single region , so a single mapper is spawned and you
are not getting enough parallel work getting done.

If that is the case then you can recreate the tables with predefined splits
to create more regions.

Thanks,
Rahul
On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <[EMAIL PROTECTED]>wrote:

>  Pavan,****
>
> How large are the rows in HBase?  22 million rows is not very much but you
> mentioned “huge strings”.  Can you tell which part of the processing is the
> limiting factor (read from HBase, mapper output, reducers)?****
>
> John****
>
> ** **
>
> ** **
>
> *From:* Pavan Sudheendra [mailto:[EMAIL PROTECTED]]
> *Sent:* Saturday, September 21, 2013 2:17 AM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: How to best decide mapper output/reducer input for a huge
> string?****
>
> ** **
>
> No, I don't have a combiner in place. Is it necessary? How do I make my
> map output compressed? Yes, the Tables in HBase are compressed.****
>
> Although, there's no real bottleneck, the time it takes to process the
> entire table is huge. I have to constantly check if i can optimize it
> somehow.. ****
>
> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
> any thing wrong with my design? Does it require any kind of re-work? Thank
> you so much for helping..****
>
> ** **
>
> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <[EMAIL PROTECTED]>
> wrote:****
>
> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.****
>
> ** **
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?****
>
> ** **
>
> Have you been able to profile your code to see where the bottlenecks are?*
> ***
>
> ** **
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <[EMAIL PROTECTED]>
> wrote:****
>
> Hi Pradeep,****
>
> Yes.. Basically i'm only writing the key part as the map output.. The V of
> <K,V> is not of much use to me.. But i'm hoping to change that if it leads
> to faster execution.. I'm kind of a newbie so looking to make the
> map/reduce job run a lot faster.. ****
>
> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
> seems if i write a map output for each and every row of a 19 m row HBase
> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)*
> ***
>
> ** **
>
> I have looked at both Pig/Hive to do the job but i'm supposed to do this
> via a MR job.. So, cannot use either of that.. Do you recommend me to try
> something if i have the data in that format?****
>
> ** **
>
> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <[EMAIL PROTECTED]>
> wrote:****
>
> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
> ****
>
> ** **
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.****
>
> ** **
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.****
>
> ** **
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT