|
|
-
Reading in parallel from table's regions in MapReduce
Ioakim Perros 2012-09-04, 15:17
Hello,
I would be grateful if someone could shed a light to the following:
Each M/R map task is reading data from a separate region of a table. From the jobtracker 's GUI, at the map completion graph, I notice that although data read from mappers are different, they read data sequentially - like the table has a lock that permits only one mapper to read data from every region at a time.
Does this "lock" hypothesis make sense? Is there any way I could avoid this useless delay?
Thanks in advance and regards, Ioakim
-
Re: Reading in parallel from table's regions in MapReduce
Doug Meil 2012-09-04, 15:32
Hi there- Yes, there is an input split for each region of the source table of a MR job. There is a blurb on that in the RefGuide... http://hbase.apache.org/book.html#splitterOn 9/4/12 11:17 AM, "Ioakim Perros" <[EMAIL PROTECTED]> wrote: >Hello, > >I would be grateful if someone could shed a light to the following: > >Each M/R map task is reading data from a separate region of a table. > From the jobtracker 's GUI, at the map completion graph, I notice that >although data read from mappers are different, they read data >sequentially - like the table has a lock that permits only one mapper to >read data from every region at a time. > >Does this "lock" hypothesis make sense? Is there any way I could avoid >this useless delay? > >Thanks in advance and regards, >Ioakim >
-
Re: Reading in parallel from table's regions in MapReduce
Ioakim Perros 2012-09-04, 15:43
Thank you very much for responding, but this was not exactly what I was looking for. I have understood the splitting process when M/R jobs read from HBase tables (that each M/R task reads from exactly one region). What I would like to clarify if possible is, if there is indeed some "locking" between map tasks concerning reading from different table's regions (because I noticed a sequential "reading behaviour" from the different map tasks), and if so, how I could avoid it, in order to speed up the procedure and make map tasks read data in parallel (each from its respective region). Thank you again very much, hoping there is an answer to that, Ioakim On 09/04/2012 06:32 PM, Doug Meil wrote: > Hi there- > > Yes, there is an input split for each region of the source table of a MR > job. > > There is a blurb on that in the RefGuide... > > http://hbase.apache.org/book.html#splitter> > > > > > On 9/4/12 11:17 AM, "Ioakim Perros" <[EMAIL PROTECTED]> wrote: > >> Hello, >> >> I would be grateful if someone could shed a light to the following: >> >> Each M/R map task is reading data from a separate region of a table. >> From the jobtracker 's GUI, at the map completion graph, I notice that >> although data read from mappers are different, they read data >> sequentially - like the table has a lock that permits only one mapper to >> read data from every region at a time. >> >> Does this "lock" hypothesis make sense? Is there any way I could avoid >> this useless delay? >> >> Thanks in advance and regards, >> Ioakim >> >
-
Re: Reading in parallel from table's regions in MapReduce
Jerry Lam 2012-09-04, 15:59
Hi Loakim: Sorry, your hypothesis doesn't make sense. I would suggest you to read the "Learning HBase Internals" by Lars Hofhansl at http://www.slideshare.net/cloudera/3-learning-h-base-internals-lars-hofhansl-salesforce-finalto understand how HBase locking works. Regarding to the issue you are facing, are you sure you configure the job properly (i.e. requesting the jobtracker to have more than 1 mapper to execute)? If you are testing on a single machine, you properly need to configure the number of tasktracker per node as well to see more than 1 mapper to execute on a single machine. my $0.02 Jerry On Tue, Sep 4, 2012 at 11:17 AM, Ioakim Perros <[EMAIL PROTECTED]> wrote: > Hello, > > I would be grateful if someone could shed a light to the following: > > Each M/R map task is reading data from a separate region of a table. > From the jobtracker 's GUI, at the map completion graph, I notice that > although data read from mappers are different, they read data sequentially > - like the table has a lock that permits only one mapper to read data from > every region at a time. > > Does this "lock" hypothesis make sense? Is there any way I could avoid > this useless delay? > > Thanks in advance and regards, > Ioakim >
-
Re: Reading in parallel from table's regions in MapReduce
Ioakim Perros 2012-09-04, 16:29
Thank you very much for your response and for the excellent reference. The thing is that I am running jobs on a distributed environment and beyond the TableMapReduceUtil settings, I have just set the scan ' s caching to the number of rows I expect to retrieve at each map task, and the scan's caching blocks feature to false (just as it is indicated at MapReduce examples of HBase's homepage). I am not aware of such a job configuration (requesting jobtracker to execute more than 1 map tasks concurrently). Any other ideas? Thank you again and regards, ioakim On 09/04/2012 06:59 PM, Jerry Lam wrote: > Hi Loakim: > > Sorry, your hypothesis doesn't make sense. I would suggest you to read the > "Learning HBase Internals" by Lars Hofhansl at > http://www.slideshare.net/cloudera/3-learning-h-base-internals-lars-hofhansl-salesforce-final> to > understand how HBase locking works. > > Regarding to the issue you are facing, are you sure you configure the job > properly (i.e. requesting the jobtracker to have more than 1 mapper to > execute)? If you are testing on a single machine, you properly need to > configure the number of tasktracker per node as well to see more than 1 > mapper to execute on a single machine. > > my $0.02 > > Jerry > > On Tue, Sep 4, 2012 at 11:17 AM, Ioakim Perros <[EMAIL PROTECTED]> wrote: > >> Hello, >> >> I would be grateful if someone could shed a light to the following: >> >> Each M/R map task is reading data from a separate region of a table. >> From the jobtracker 's GUI, at the map completion graph, I notice that >> although data read from mappers are different, they read data sequentially >> - like the table has a lock that permits only one mapper to read data from >> every region at a time. >> >> Does this "lock" hypothesis make sense? Is there any way I could avoid >> this useless delay? >> >> Thanks in advance and regards, >> Ioakim >>
-
Re: Reading in parallel from table's regions in MapReduce
Michael Segel 2012-09-04, 16:41
I think the issue is that you are misinterpreting what you are seeing and what Doug was trying to tell you... The short simple answer is that you're getting one split per region. Each split is assigned to a specific mapper task and that task will sequentially walk through the table finding the rows that match your scan request. There is no lock or blocking. I think you really should actually read Lars George's book on HBase to get a better understanding. HTH -Mike On Sep 4, 2012, at 11:29 AM, Ioakim Perros <[EMAIL PROTECTED]> wrote: > Thank you very much for your response and for the excellent reference. > > The thing is that I am running jobs on a distributed environment and beyond the TableMapReduceUtil settings, > > I have just set the scan ' s caching to the number of rows I expect to retrieve at each map task, and the scan's caching blocks feature to false (just as it is indicated at MapReduce examples of HBase's homepage). > > I am not aware of such a job configuration (requesting jobtracker to execute more than 1 map tasks concurrently). Any other ideas? > > Thank you again and regards, > ioakim > > > On 09/04/2012 06:59 PM, Jerry Lam wrote: >> Hi Loakim: >> >> Sorry, your hypothesis doesn't make sense. I would suggest you to read the >> "Learning HBase Internals" by Lars Hofhansl at >> http://www.slideshare.net/cloudera/3-learning-h-base-internals-lars-hofhansl-salesforce-final>> to >> understand how HBase locking works. >> >> Regarding to the issue you are facing, are you sure you configure the job >> properly (i.e. requesting the jobtracker to have more than 1 mapper to >> execute)? If you are testing on a single machine, you properly need to >> configure the number of tasktracker per node as well to see more than 1 >> mapper to execute on a single machine. >> >> my $0.02 >> >> Jerry >> >> On Tue, Sep 4, 2012 at 11:17 AM, Ioakim Perros <[EMAIL PROTECTED]> wrote: >> >>> Hello, >>> >>> I would be grateful if someone could shed a light to the following: >>> >>> Each M/R map task is reading data from a separate region of a table. >>> From the jobtracker 's GUI, at the map completion graph, I notice that >>> although data read from mappers are different, they read data sequentially >>> - like the table has a lock that permits only one mapper to read data from >>> every region at a time. >>> >>> Does this "lock" hypothesis make sense? Is there any way I could avoid >>> this useless delay? >>> >>> Thanks in advance and regards, >>> Ioakim >>> > >
-
Re: Reading in parallel from table's regions in MapReduce
Ioakim Perros 2012-09-04, 16:50
I understood that locking is at a row-level (and that my initial hypothesis is hopefully false) , but I was trying to clarify if there is some job configuration I am missing. Perhaps you 're right and I am misinterpreting the jobtracker's map completion graph. Thanks for answering. On 09/04/2012 07:41 PM, Michael Segel wrote: > I think the issue is that you are misinterpreting what you are seeing and what Doug was trying to tell you... > > The short simple answer is that you're getting one split per region. Each split is assigned to a specific mapper task and that task will sequentially walk through the table finding the rows that match your scan request. > > There is no lock or blocking. > > I think you really should actually read Lars George's book on HBase to get a better understanding. > > HTH > > -Mike > > On Sep 4, 2012, at 11:29 AM, Ioakim Perros <[EMAIL PROTECTED]> wrote: > >> Thank you very much for your response and for the excellent reference. >> >> The thing is that I am running jobs on a distributed environment and beyond the TableMapReduceUtil settings, >> >> I have just set the scan ' s caching to the number of rows I expect to retrieve at each map task, and the scan's caching blocks feature to false (just as it is indicated at MapReduce examples of HBase's homepage). >> >> I am not aware of such a job configuration (requesting jobtracker to execute more than 1 map tasks concurrently). Any other ideas? >> >> Thank you again and regards, >> ioakim >> >> >> On 09/04/2012 06:59 PM, Jerry Lam wrote: >>> Hi Loakim: >>> >>> Sorry, your hypothesis doesn't make sense. I would suggest you to read the >>> "Learning HBase Internals" by Lars Hofhansl at >>> http://www.slideshare.net/cloudera/3-learning-h-base-internals-lars-hofhansl-salesforce-final>>> to >>> understand how HBase locking works. >>> >>> Regarding to the issue you are facing, are you sure you configure the job >>> properly (i.e. requesting the jobtracker to have more than 1 mapper to >>> execute)? If you are testing on a single machine, you properly need to >>> configure the number of tasktracker per node as well to see more than 1 >>> mapper to execute on a single machine. >>> >>> my $0.02 >>> >>> Jerry >>> >>> On Tue, Sep 4, 2012 at 11:17 AM, Ioakim Perros <[EMAIL PROTECTED]> wrote: >>> >>>> Hello, >>>> >>>> I would be grateful if someone could shed a light to the following: >>>> >>>> Each M/R map task is reading data from a separate region of a table. >>>> From the jobtracker 's GUI, at the map completion graph, I notice that >>>> although data read from mappers are different, they read data sequentially >>>> - like the table has a lock that permits only one mapper to read data from >>>> every region at a time. >>>> >>>> Does this "lock" hypothesis make sense? Is there any way I could avoid >>>> this useless delay? >>>> >>>> Thanks in advance and regards, >>>> Ioakim >>>> >>
-
Re: Reading in parallel from table's regions in MapReduce
Jerry Lam 2012-09-04, 17:05
Hi Loakim: Here a list of links I would suggest you to read (I know it is a lot to read): HBase Related: - http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html- http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description- make sure to read the examples: http://hbase.apache.org/book/mapreduce.example.htmlHadoop Related: - http://wiki.apache.org/hadoop/JobTracker- http://wiki.apache.org/hadoop/TaskTracker- http://hadoop.apache.org/common/docs/r1.0.3/mapred_tutorial.html- Some Configurations: http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.htmlHTH, Jerry On Tue, Sep 4, 2012 at 12:41 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > I think the issue is that you are misinterpreting what you are seeing and > what Doug was trying to tell you... > > The short simple answer is that you're getting one split per region. Each > split is assigned to a specific mapper task and that task will sequentially > walk through the table finding the rows that match your scan request. > > There is no lock or blocking. > > I think you really should actually read Lars George's book on HBase to get > a better understanding. > > HTH > > -Mike > > On Sep 4, 2012, at 11:29 AM, Ioakim Perros <[EMAIL PROTECTED]> wrote: > > > Thank you very much for your response and for the excellent reference. > > > > The thing is that I am running jobs on a distributed environment and > beyond the TableMapReduceUtil settings, > > > > I have just set the scan ' s caching to the number of rows I expect to > retrieve at each map task, and the scan's caching blocks feature to false > (just as it is indicated at MapReduce examples of HBase's homepage). > > > > I am not aware of such a job configuration (requesting jobtracker to > execute more than 1 map tasks concurrently). Any other ideas? > > > > Thank you again and regards, > > ioakim > > > > > > On 09/04/2012 06:59 PM, Jerry Lam wrote: > >> Hi Loakim: > >> > >> Sorry, your hypothesis doesn't make sense. I would suggest you to read > the > >> "Learning HBase Internals" by Lars Hofhansl at > >> > http://www.slideshare.net/cloudera/3-learning-h-base-internals-lars-hofhansl-salesforce-final> >> to > >> understand how HBase locking works. > >> > >> Regarding to the issue you are facing, are you sure you configure the > job > >> properly (i.e. requesting the jobtracker to have more than 1 mapper to > >> execute)? If you are testing on a single machine, you properly need to > >> configure the number of tasktracker per node as well to see more than 1 > >> mapper to execute on a single machine. > >> > >> my $0.02 > >> > >> Jerry > >> > >> On Tue, Sep 4, 2012 at 11:17 AM, Ioakim Perros <[EMAIL PROTECTED]> > wrote: > >> > >>> Hello, > >>> > >>> I would be grateful if someone could shed a light to the following: > >>> > >>> Each M/R map task is reading data from a separate region of a table. > >>> From the jobtracker 's GUI, at the map completion graph, I notice that > >>> although data read from mappers are different, they read data > sequentially > >>> - like the table has a lock that permits only one mapper to read data > from > >>> every region at a time. > >>> > >>> Does this "lock" hypothesis make sense? Is there any way I could avoid > >>> this useless delay? > >>> > >>> Thanks in advance and regards, > >>> Ioakim > >>> > > > > > >
-
Re: Reading in parallel from table's regions in MapReduce
Ioakim Perros 2012-09-04, 17:15
Jerry thank you very much for the links. Regards, Ioakim On 09/04/2012 08:05 PM, Jerry Lam wrote: > Hi Loakim: > > Here a list of links I would suggest you to read (I know it is a lot to > read): > HBase Related: > - > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html> - > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description> - make sure to read the examples: > http://hbase.apache.org/book/mapreduce.example.html> > Hadoop Related: > - http://wiki.apache.org/hadoop/JobTracker> - http://wiki.apache.org/hadoop/TaskTracker> - http://hadoop.apache.org/common/docs/r1.0.3/mapred_tutorial.html> - Some Configurations: > http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html> > HTH, > > Jerry > > > On Tue, Sep 4, 2012 at 12:41 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > >> I think the issue is that you are misinterpreting what you are seeing and >> what Doug was trying to tell you... >> >> The short simple answer is that you're getting one split per region. Each >> split is assigned to a specific mapper task and that task will sequentially >> walk through the table finding the rows that match your scan request. >> >> There is no lock or blocking. >> >> I think you really should actually read Lars George's book on HBase to get >> a better understanding. >> >> HTH >> >> -Mike >> >> On Sep 4, 2012, at 11:29 AM, Ioakim Perros <[EMAIL PROTECTED]> wrote: >> >>> Thank you very much for your response and for the excellent reference. >>> >>> The thing is that I am running jobs on a distributed environment and >> beyond the TableMapReduceUtil settings, >>> I have just set the scan ' s caching to the number of rows I expect to >> retrieve at each map task, and the scan's caching blocks feature to false >> (just as it is indicated at MapReduce examples of HBase's homepage). >>> I am not aware of such a job configuration (requesting jobtracker to >> execute more than 1 map tasks concurrently). Any other ideas? >>> Thank you again and regards, >>> ioakim >>> >>> >>> On 09/04/2012 06:59 PM, Jerry Lam wrote: >>>> Hi Loakim: >>>> >>>> Sorry, your hypothesis doesn't make sense. I would suggest you to read >> the >>>> "Learning HBase Internals" by Lars Hofhansl at >>>> >> http://www.slideshare.net/cloudera/3-learning-h-base-internals-lars-hofhansl-salesforce-final>>>> to >>>> understand how HBase locking works. >>>> >>>> Regarding to the issue you are facing, are you sure you configure the >> job >>>> properly (i.e. requesting the jobtracker to have more than 1 mapper to >>>> execute)? If you are testing on a single machine, you properly need to >>>> configure the number of tasktracker per node as well to see more than 1 >>>> mapper to execute on a single machine. >>>> >>>> my $0.02 >>>> >>>> Jerry >>>> >>>> On Tue, Sep 4, 2012 at 11:17 AM, Ioakim Perros <[EMAIL PROTECTED]> >> wrote: >>>>> Hello, >>>>> >>>>> I would be grateful if someone could shed a light to the following: >>>>> >>>>> Each M/R map task is reading data from a separate region of a table. >>>>> From the jobtracker 's GUI, at the map completion graph, I notice that >>>>> although data read from mappers are different, they read data >> sequentially >>>>> - like the table has a lock that permits only one mapper to read data >> from >>>>> every region at a time. >>>>> >>>>> Does this "lock" hypothesis make sense? Is there any way I could avoid >>>>> this useless delay? >>>>> >>>>> Thanks in advance and regards, >>>>> Ioakim >>>>> >>> >>
-
Re: Reading in parallel from table's regions in MapReduce
Stack 2012-09-04, 19:13
On Tue, Sep 4, 2012 at 8:17 AM, Ioakim Perros <[EMAIL PROTECTED]> wrote: > Hello, > > I would be grateful if someone could shed a light to the following: > > Each M/R map task is reading data from a separate region of a table. > From the jobtracker 's GUI, at the map completion graph, I notice that > although data read from mappers are different, they read data sequentially - > like the table has a lock that permits only one mapper to read data from > every region at a time. >
Your mapreduce job is actually running on the cluster and not in a single thread local (as Jerry hints above). St.Ack
|
|