-Re: Reading in parallel from table's regions in MapReduce
Jerry Lam 2012-09-04, 17:05
Here a list of links I would suggest you to read (I know it is a lot to
- make sure to read the examples:
- Some Configurations:
On Tue, Sep 4, 2012 at 12:41 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
> I think the issue is that you are misinterpreting what you are seeing and
> what Doug was trying to tell you...
> The short simple answer is that you're getting one split per region. Each
> split is assigned to a specific mapper task and that task will sequentially
> walk through the table finding the rows that match your scan request.
> There is no lock or blocking.
> I think you really should actually read Lars George's book on HBase to get
> a better understanding.
> On Sep 4, 2012, at 11:29 AM, Ioakim Perros <[EMAIL PROTECTED]> wrote:
> > Thank you very much for your response and for the excellent reference.
> > The thing is that I am running jobs on a distributed environment and
> beyond the TableMapReduceUtil settings,
> > I have just set the scan ' s caching to the number of rows I expect to
> retrieve at each map task, and the scan's caching blocks feature to false
> (just as it is indicated at MapReduce examples of HBase's homepage).
> > I am not aware of such a job configuration (requesting jobtracker to
> execute more than 1 map tasks concurrently). Any other ideas?
> > Thank you again and regards,
> > ioakim
> > On 09/04/2012 06:59 PM, Jerry Lam wrote:
> >> Hi Loakim:
> >> Sorry, your hypothesis doesn't make sense. I would suggest you to read
> >> "Learning HBase Internals" by Lars Hofhansl at
> >> to
> >> understand how HBase locking works.
> >> Regarding to the issue you are facing, are you sure you configure the
> >> properly (i.e. requesting the jobtracker to have more than 1 mapper to
> >> execute)? If you are testing on a single machine, you properly need to
> >> configure the number of tasktracker per node as well to see more than 1
> >> mapper to execute on a single machine.
> >> my $0.02
> >> Jerry
> >> On Tue, Sep 4, 2012 at 11:17 AM, Ioakim Perros <[EMAIL PROTECTED]>
> >>> Hello,
> >>> I would be grateful if someone could shed a light to the following:
> >>> Each M/R map task is reading data from a separate region of a table.
> >>> From the jobtracker 's GUI, at the map completion graph, I notice that
> >>> although data read from mappers are different, they read data
> >>> - like the table has a lock that permits only one mapper to read data
> >>> every region at a time.
> >>> Does this "lock" hypothesis make sense? Is there any way I could avoid
> >>> this useless delay?
> >>> Thanks in advance and regards,
> >>> Ioakim