Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Slow full-table scans


Copy link to this message
-
Re: Slow full-table scans
Jacques 2012-08-12, 22:59
HTable.getRegionLocations()

I didn't realize the KeyValue serializations/deserialization happened on a
separate thread in the hbase client code.

J

On Sun, Aug 12, 2012 at 3:52 PM, Gurjeet Singh <[EMAIL PROTECTED]> wrote:

> Hi Mohammad,
>
> This is a great idea. Is there a API call to determine the start/end
> key for each region ?
>
> Thanks,
> Gurjeet
>
> On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <[EMAIL PROTECTED]>
> wrote:
> > Hello experts,
> >
> >        Would it be feasible to create a separate thread for each
> region??I
> > mean we can determine start and end key of each region and issue a scan
> for
> > each region in parallel.
> >
> > Regards,
> >     Mohammad Tariq
> >
> >
> >
> > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <[EMAIL PROTECTED]>
> wrote:
> >
> >> Do you really have to retrieve all 200.000 each time?
> >> Scan.setBatch(...) makes no difference?! (note that batching is
> different
> >> and separate from caching).
> >>
> >> Also note that the scanner contract is to return sorted KVs, so a single
> >> scan cannot be parallelized across RegionServers (well not entirely
> true,
> >> it could be farmed off in parallel and then be presented to the client
> in
> >> the right order - but HBase is not doing that). That is why one vs 12
> RSs
> >> makes no difference in this scenario.
> >>
> >> In the 12 node case you'll see low CPU on all but one RS, and each RS
> will
> >> get its turn.
> >>
> >> In your case this is scanning 20.000.000 KVs serially in 400s, that's
> >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase
> (but
> >> not great either).
> >>
> >> If you only ever expect to run a single query like this on top your
> >> cluster (i.e. your concern is latency not throughput) you can do
> multiple
> >> RPCs in parallel for a sub portion of your key range. Together with
> >> batching can start using value before all is streamed back from the
> server.
> >>
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ----- Original Message -----
> >> From: Gurjeet Singh <[EMAIL PROTECTED]>
> >> To: [EMAIL PROTECTED]
> >> Cc:
> >> Sent: Saturday, August 11, 2012 11:04 PM
> >> Subject: Slow full-table scans
> >>
> >> Hi,
> >>
> >> I am trying to read all the data out of an HBase table using a scan
> >> and it is extremely slow.
> >>
> >> Here are some characteristics of the data:
> >>
> >> 1. The total table size is tiny (~200MB)
> >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
> >> Thus the size of each cell is ~10bytes and the size of each row is
> >> ~2MB
> >> 3. Currently scanning the whole table takes ~400s (both in a
> >> distributed setting with 12 nodes or so and on a single node), thus
> >> 5sec/row
> >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
> >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
> >> and is set to fetch 100MB of data at a time (scan.setCaching)
> >> 6. Changing the caching size seems to have no effect on the total scan
> >> time at all
> >> 7. The column family is setup to keep a single version of the cells,
> >> no compression, and no block cache.
> >>
> >> Am I missing something ? Is there a way to optimize this ?
> >>
> >> I guess a general question I have is whether HBase is good datastore
> >> for storing many medium sized (~50GB), dense datasets with lots of
> >> columns when a lot of the queries require full table scans ?
> >>
> >> Thanks!
> >> Gurjeet
> >>
> >>
>