Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Slow full-table scans


Copy link to this message
-
Re: Slow full-table scans
Mohammad Tariq 2012-08-12, 22:49
Hello experts,

       Would it be feasible to create a separate thread for each region??I
mean we can determine start and end key of each region and issue a scan for
each region in parallel.

Regards,
    Mohammad Tariq

On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Do you really have to retrieve all 200.000 each time?
> Scan.setBatch(...) makes no difference?! (note that batching is different
> and separate from caching).
>
> Also note that the scanner contract is to return sorted KVs, so a single
> scan cannot be parallelized across RegionServers (well not entirely true,
> it could be farmed off in parallel and then be presented to the client in
> the right order - but HBase is not doing that). That is why one vs 12 RSs
> makes no difference in this scenario.
>
> In the 12 node case you'll see low CPU on all but one RS, and each RS will
> get its turn.
>
> In your case this is scanning 20.000.000 KVs serially in 400s, that's
> 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but
> not great either).
>
> If you only ever expect to run a single query like this on top your
> cluster (i.e. your concern is latency not throughput) you can do multiple
> RPCs in parallel for a sub portion of your key range. Together with
> batching can start using value before all is streamed back from the server.
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Gurjeet Singh <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc:
> Sent: Saturday, August 11, 2012 11:04 PM
> Subject: Slow full-table scans
>
> Hi,
>
> I am trying to read all the data out of an HBase table using a scan
> and it is extremely slow.
>
> Here are some characteristics of the data:
>
> 1. The total table size is tiny (~200MB)
> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
> Thus the size of each cell is ~10bytes and the size of each row is
> ~2MB
> 3. Currently scanning the whole table takes ~400s (both in a
> distributed setting with 12 nodes or so and on a single node), thus
> 5sec/row
> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
> and is set to fetch 100MB of data at a time (scan.setCaching)
> 6. Changing the caching size seems to have no effect on the total scan
> time at all
> 7. The column family is setup to keep a single version of the cells,
> no compression, and no block cache.
>
> Am I missing something ? Is there a way to optimize this ?
>
> I guess a general question I have is whether HBase is good datastore
> for storing many medium sized (~50GB), dense datasets with lots of
> columns when a lot of the queries require full table scans ?
>
> Thanks!
> Gurjeet
>
>