Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Reading in parallel from table's regions in MapReduce


Copy link to this message
-
Re: Reading in parallel from table's regions in MapReduce
Hi Loakim:

Here a list of links I would suggest you to read (I know it is a lot to
read):
HBase Related:
-
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html
-
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
- make sure to read the examples:
http://hbase.apache.org/book/mapreduce.example.html

Hadoop Related:
- http://wiki.apache.org/hadoop/JobTracker
- http://wiki.apache.org/hadoop/TaskTracker
- http://hadoop.apache.org/common/docs/r1.0.3/mapred_tutorial.html
- Some Configurations:
http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html

HTH,

Jerry
On Tue, Sep 4, 2012 at 12:41 PM, Michael Segel <[EMAIL PROTECTED]>wrote:

> I think the issue is that you are misinterpreting what you are seeing and
> what Doug was trying to tell you...
>
> The short simple answer is that you're getting one split per region. Each
> split is assigned to a specific mapper task and that task will sequentially
> walk through the table finding the rows that match your scan request.
>
> There is no lock or blocking.
>
> I think you really should actually read Lars George's book on HBase to get
> a better understanding.
>
> HTH
>
> -Mike
>
> On Sep 4, 2012, at 11:29 AM, Ioakim Perros <[EMAIL PROTECTED]> wrote:
>
> > Thank you very much for your response and for the excellent reference.
> >
> > The thing is that I am running jobs on a distributed environment and
> beyond the TableMapReduceUtil settings,
> >
> > I have just set the scan ' s caching to the number of rows I expect to
> retrieve at each map task, and the scan's caching blocks feature to false
> (just as it is indicated at MapReduce examples of HBase's homepage).
> >
> > I am not aware of such a job configuration (requesting jobtracker to
> execute more than 1 map tasks concurrently). Any other ideas?
> >
> > Thank you again and regards,
> > ioakim
> >
> >
> > On 09/04/2012 06:59 PM, Jerry Lam wrote:
> >> Hi Loakim:
> >>
> >> Sorry, your hypothesis doesn't make sense. I would suggest you to read
> the
> >> "Learning HBase Internals" by Lars Hofhansl at
> >>
> http://www.slideshare.net/cloudera/3-learning-h-base-internals-lars-hofhansl-salesforce-final
> >> to
> >> understand how HBase locking works.
> >>
> >> Regarding to the issue you are facing, are you sure you configure the
> job
> >> properly (i.e. requesting the jobtracker to have more than 1 mapper to
> >> execute)? If you are testing on a single machine, you properly need to
> >> configure the number of tasktracker per node as well to see more than 1
> >> mapper to execute on a single machine.
> >>
> >> my $0.02
> >>
> >> Jerry
> >>
> >> On Tue, Sep 4, 2012 at 11:17 AM, Ioakim Perros <[EMAIL PROTECTED]>
> wrote:
> >>
> >>> Hello,
> >>>
> >>> I would be grateful if someone could shed a light to the following:
> >>>
> >>> Each M/R map task is reading data from a separate region of a table.
> >>> From the jobtracker 's GUI, at the map completion graph, I notice that
> >>> although data read from mappers are different, they read data
> sequentially
> >>> - like the table has a lock that permits only one mapper to read data
> from
> >>> every region at a time.
> >>>
> >>> Does this "lock" hypothesis make sense? Is there any way I could avoid
> >>> this useless delay?
> >>>
> >>> Thanks in advance and regards,
> >>> Ioakim
> >>>
> >
> >
>
>