Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Slow full-table scans


Copy link to this message
-
Re: Slow full-table scans
Mohammad Tariq 2012-08-12, 23:00
Methods getStartKey and getEndKey provided by  HRegionInfo class can used
for that purpose.
Also, please make sure, any HTable instance is not left opened once you are
are done with reads.
Regards,
    Mohammad Tariq

On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote:

> Hi Mohammad,
>
> This is a great idea. Is there a API call to determine the start/end
> key for each region ?
>
> Thanks,
> Gurjeet
>
> On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <[EMAIL PROTECTED]>
> wrote:
> > Hello experts,
> >
> >        Would it be feasible to create a separate thread for each
> region??I
> > mean we can determine start and end key of each region and issue a scan
> for
> > each region in parallel.
> >
> > Regards,
> >     Mohammad Tariq
> >
> >
> >
> > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <[EMAIL PROTECTED]>
> wrote:
> >
> >> Do you really have to retrieve all 200.000 each time?
> >> Scan.setBatch(...) makes no difference?! (note that batching is
> different
> >> and separate from caching).
> >>
> >> Also note that the scanner contract is to return sorted KVs, so a single
> >> scan cannot be parallelized across RegionServers (well not entirely
> true,
> >> it could be farmed off in parallel and then be presented to the client
> in
> >> the right order - but HBase is not doing that). That is why one vs 12
> RSs
> >> makes no difference in this scenario.
> >>
> >> In the 12 node case you'll see low CPU on all but one RS, and each RS
> will
> >> get its turn.
> >>
> >> In your case this is scanning 20.000.000 KVs serially in 400s, that's
> >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase
> (but
> >> not great either).
> >>
> >> If you only ever expect to run a single query like this on top your
> >> cluster (i.e. your concern is latency not throughput) you can do
> multiple
> >> RPCs in parallel for a sub portion of your key range. Together with
> >> batching can start using value before all is streamed back from the
> server.
> >>
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ----- Original Message -----
> >> From: Gurjeet Singh <[EMAIL PROTECTED]>
> >> To: [EMAIL PROTECTED]
> >> Cc:
> >> Sent: Saturday, August 11, 2012 11:04 PM
> >> Subject: Slow full-table scans
> >>
> >> Hi,
> >>
> >> I am trying to read all the data out of an HBase table using a scan
> >> and it is extremely slow.
> >>
> >> Here are some characteristics of the data:
> >>
> >> 1. The total table size is tiny (~200MB)
> >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
> >> Thus the size of each cell is ~10bytes and the size of each row is
> >> ~2MB
> >> 3. Currently scanning the whole table takes ~400s (both in a
> >> distributed setting with 12 nodes or so and on a single node), thus
> >> 5sec/row
> >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
> >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
> >> and is set to fetch 100MB of data at a time (scan.setCaching)
> >> 6. Changing the caching size seems to have no effect on the total scan
> >> time at all
> >> 7. The column family is setup to keep a single version of the cells,
> >> no compression, and no block cache.
> >>
> >> Am I missing something ? Is there a way to optimize this ?
> >>
> >> I guess a general question I have is whether HBase is good datastore
> >> for storing many medium sized (~50GB), dense datasets with lots of
> >> columns when a lot of the queries require full table scans ?
> >>
> >> Thanks!
> >> Gurjeet
> >>
> >>
>