Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Using Scans in parallel


Copy link to this message
-
Re: Using Scans in parallel
This is just scanning (reads). I'll need to do more testing to find a cause, hopefully it is something with my test.

On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote:

> Which version of HBase?
> Are there concurrent inserts? If so, do you see splits in the log files happening while you do the scanning?
>
> I am pretty sure this has nothing to do with concurrent scans.
>
> From: Bryan Keller <[EMAIL PROTECTED]>
> To: Bryan Keller <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]
> Sent: Sunday, October 9, 2011 11:03 AM
> Subject: Re: Using Scans in parallel
>
> On further thought, it seems this might be a serious issue, as two unrelated processes within an application may be scanning the same table at the same time.
>
> On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote:
>
> > I was not able to get consistent results using multiple scanners in parallel on a table. I implemented a counter test that used 8 scanners in parallel on a table with 2m rows with 2k+ columns each, and the results were not consistent. There were no errors thrown, but the count was off by as much as 2%. Using a single thread gave the same (correct) result every run.
> >
> > I tried various approaches, such as creating an HTable and opening a connection per thread, but I was not able to get stable results. I would do some testing before using parallel scanners as described here.
> >
> >
> > On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote:
> >
> >> That's part of it, the other part is to get the region demarcations.
> >> You can also just get the smallest and largest key of the table and pick other demarcations for your scans. Then your individual scans will likely cover multiple regions and regionservers.
> >>
> >>
> >> Your threading model depends on your needs. If you interested in lowest latency you want to keep your regionservers busy for each query.
> >> What exactly that means depends on your setup. Maybe you split up the overall scan so that no more than N scans are active at any regionserver.
> >>
> >> If you're more interested in overall predictability, you might not want parallelize each scan too much.
> >>
> >>
> >>
> >> ----- Original Message -----
> >> From: Sam Seigal <[EMAIL PROTECTED]>
> >> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
> >> Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> >> Sent: Wednesday, October 5, 2011 6:18 PM
> >> Subject: Re: Using Scans in parallel
> >>
> >> So the whole point of getting the region locations is to ensure that
> >> there is one thread per region server ?
> >>
> >>
> >> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> >>> Hi Sam,
> >>>
> >>>
> >>> There were some attempts to build this in. In the end I think the exact patterns are different based on what one is trying to achieve.
> >>> Currently what you can do is getting all the region locations (HTable.getRegionLocations). From the HRegionInfos you can
> >>> get the regions start and end keys.
> >>> Now you can issue parallel scan for as many regions as you want (by create a Scan object with start and row set to the region's
> >>> start and end key).
> >>> You probably want to group the regions by regionserver and have one thread per region server, or something.
> >>>
> >>>
> >>> -- Lars
> >>> ________________________________
> >>> From: Sam Seigal <[EMAIL PROTECTED]>
> >>> To: [EMAIL PROTECTED]
> >>> Sent: Wednesday, October 5, 2011 4:29 PM
> >>> Subject: Using Scans in parallel
> >>>
> >>> Hi ,
> >>>
> >>> Is there a known way to be able to do Scan's in parallel (in different
> >>> threads even) and then sort/combine the output ?
> >>>
> >>> For a row key like:
> >>>
> >>> prefix-event_type-event_id
> >>> prefix-event_type-event_id
> >>>
> >>> I want to declare two scan objects (for say event_id_type foo)
> >>>
> >>> Scan 1 =>  0-foo
> >>> Scan 2 =>  1-foo
> >>>
> >>> execute the scans in parallel (maybe even in different threads) and
> >>> then merge the results ?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB