Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> How to best decide mapper output/reducer input for a huge string?

Copy link to this message
Re: How to best decide mapper output/reducer input for a huge string?
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
are functional at the same time.. Although, @Pradeep, i should do the
compression like you say.. I'll give it a shot.. As far as i can see, i
think i'll need to implement Writable and write out the key of the mapper
using the specific data types instead of writing it out as a string which
might slow the operation down..
On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <[EMAIL PROTECTED]>wrote:

> Pavan,
> It's hard to tell whether there's anything wrong with your design or not
> since you haven't given us specific enough details. The best thing you can
> do is instrument your code and see what is taking a long time. Rahul
> mentioned a problem that I myself have seen before, with only one region
> (or a couple) having any data. So even if you have 21 regions, only mapper
> might be doing the heavy lifting.
> A combiner is hugely helpful in terms of reducing the data output of
> mappers. Writing a combiner is a best practice and you should almost always
> have one. Compression can be turned on by setting the following properties
> in your job config.
>  <property>
>      <name> mapreduce.map.output.compress </name>
>      <value> true</value>
>  </property>
>  <property>
>      <name>mapreduce.map.output.compress.codec</name>
>      <value>org.apache.hadoop.io.compress.GzipCodec</value>
>  </property>
> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
> depending on your use cases. Gzip is really slow but gets the best
> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
> compression ratio. If your computations are CPU bound, then you'd probably
> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
> are idle, you can use Gzip. You'll have to experiment and find the best
> settings for you. There are a lot of other tweaks that you can try to get
> the best performance out of your cluster.
> One of the best things you can do is to install Ganglia (or some other
> similar tool) on your cluster and monitor usage of resources while your job
> is running. This will tell you if your job is I/O bound or CPU bound.
> Take a look at this paper by Intel about optimizing your Hadoop cluster
> and see if that fits your deployment.
> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
> If your cluster is already optimized and your job is not I/O bound, then
> there might be a problem with your algorithm and might warrant a redesign.
> Hope this helps!
> - Pradeep
> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
>> One mapper is spawned per hbase table region. You can use the admin ui of
>> hbase to find the number of regions per table. It might happen that all the
>> data is sitting in a single region , so a single mapper is spawned and you
>> are not getting enough parallel work getting done.
>> If that is the case then you can recreate the tables with predefined
>> splits to create more regions.
>> Thanks,
>> Rahul
>> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <[EMAIL PROTECTED]>wrote:
>>>  Pavan,****
>>> How large are the rows in HBase?  22 million rows is not very much but
>>> you mentioned “huge strings”.  Can you tell which part of the processing is
>>> the limiting factor (read from HBase, mapper output, reducers)?****
>>> John****
>>> ** **
>>> ** **
>>> *From:* Pavan Sudheendra [mailto:[EMAIL PROTECTED]]
>>> *Sent:* Saturday, September 21, 2013 2:17 AM
>>> *Subject:* Re: How to best decide mapper output/reducer input for a
>>> huge string?****
>>> ** **
>>> No, I don't have a combiner in place. Is it necessary? How do I make my
>>> map output compressed? Yes, the Tables in HBase are compressed.****
>>> Although, there's no real bottleneck, the time it takes to process the
>>> entire table is huge. I have to constantly check if i can optimize it