Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re: Slow full-table scans


+
Lars H 2012-08-23, 01:38
+
Gurjeet Singh 2012-08-23, 02:01
+
lars hofhansl 2012-08-24, 17:27
+
Gurjeet Singh 2012-08-12, 06:04
+
lars hofhansl 2012-08-12, 22:24
+
Gurjeet Singh 2012-08-12, 22:51
+
lars hofhansl 2012-08-12, 23:00
+
Gurjeet Singh 2012-08-13, 05:10
+
Stack 2012-08-13, 07:27
+
Gurjeet Singh 2012-08-13, 07:51
+
Gurjeet Singh 2012-08-13, 22:12
+
lars hofhansl 2012-08-14, 00:30
+
Gurjeet Singh 2012-08-14, 01:10
+
Stack 2012-08-15, 22:13
+
lars hofhansl 2012-08-16, 00:16
+
Gurjeet Singh 2012-08-16, 18:26
+
lars hofhansl 2012-08-16, 18:36
+
Gurjeet Singh 2012-08-16, 18:40
+
Gurjeet Singh 2012-08-21, 02:42
+
lars hofhansl 2012-08-21, 02:50
+
lars hofhansl 2012-08-21, 18:18
+
Gurjeet Singh 2012-08-21, 18:31
+
lars hofhansl 2012-08-21, 23:33
+
Mohit Anchlia 2012-08-22, 00:56
+
J Mohamed Zahoor 2012-08-22, 05:00
+
Gurjeet Singh 2012-08-22, 16:42
+
Mohammad Tariq 2012-08-12, 22:49
+
Gurjeet Singh 2012-08-12, 22:52
+
Mohammad Tariq 2012-08-12, 23:00
Copy link to this message
-
Re: Slow full-table scans
I think the first question is where is the time spent.  Does your analysis
show that all the time spent is on the regionservers or is a portion of the
bottleneck on the client side?

Jacques

On Sun, Aug 12, 2012 at 4:00 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> Methods getStartKey and getEndKey provided by  HRegionInfo class can used
> for that purpose.
> Also, please make sure, any HTable instance is not left opened once you are
> are done with reads.
> Regards,
>     Mohammad Tariq
>
>
>
> On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh <[EMAIL PROTECTED]> wrote:
>
> > Hi Mohammad,
> >
> > This is a great idea. Is there a API call to determine the start/end
> > key for each region ?
> >
> > Thanks,
> > Gurjeet
> >
> > On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <[EMAIL PROTECTED]>
> > wrote:
> > > Hello experts,
> > >
> > >        Would it be feasible to create a separate thread for each
> > region??I
> > > mean we can determine start and end key of each region and issue a scan
> > for
> > > each region in parallel.
> > >
> > > Regards,
> > >     Mohammad Tariq
> > >
> > >
> > >
> > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <[EMAIL PROTECTED]>
> > wrote:
> > >
> > >> Do you really have to retrieve all 200.000 each time?
> > >> Scan.setBatch(...) makes no difference?! (note that batching is
> > different
> > >> and separate from caching).
> > >>
> > >> Also note that the scanner contract is to return sorted KVs, so a
> single
> > >> scan cannot be parallelized across RegionServers (well not entirely
> > true,
> > >> it could be farmed off in parallel and then be presented to the client
> > in
> > >> the right order - but HBase is not doing that). That is why one vs 12
> > RSs
> > >> makes no difference in this scenario.
> > >>
> > >> In the 12 node case you'll see low CPU on all but one RS, and each RS
> > will
> > >> get its turn.
> > >>
> > >> In your case this is scanning 20.000.000 KVs serially in 400s, that's
> > >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase
> > (but
> > >> not great either).
> > >>
> > >> If you only ever expect to run a single query like this on top your
> > >> cluster (i.e. your concern is latency not throughput) you can do
> > multiple
> > >> RPCs in parallel for a sub portion of your key range. Together with
> > >> batching can start using value before all is streamed back from the
> > server.
> > >>
> > >>
> > >> -- Lars
> > >>
> > >>
> > >>
> > >> ----- Original Message -----
> > >> From: Gurjeet Singh <[EMAIL PROTECTED]>
> > >> To: [EMAIL PROTECTED]
> > >> Cc:
> > >> Sent: Saturday, August 11, 2012 11:04 PM
> > >> Subject: Slow full-table scans
> > >>
> > >> Hi,
> > >>
> > >> I am trying to read all the data out of an HBase table using a scan
> > >> and it is extremely slow.
> > >>
> > >> Here are some characteristics of the data:
> > >>
> > >> 1. The total table size is tiny (~200MB)
> > >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
> > >> Thus the size of each cell is ~10bytes and the size of each row is
> > >> ~2MB
> > >> 3. Currently scanning the whole table takes ~400s (both in a
> > >> distributed setting with 12 nodes or so and on a single node), thus
> > >> 5sec/row
> > >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
> > >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
> > >> and is set to fetch 100MB of data at a time (scan.setCaching)
> > >> 6. Changing the caching size seems to have no effect on the total scan
> > >> time at all
> > >> 7. The column family is setup to keep a single version of the cells,
> > >> no compression, and no block cache.
> > >>
> > >> Am I missing something ? Is there a way to optimize this ?
> > >>
> > >> I guess a general question I have is whether HBase is good datastore
> > >> for storing many medium sized (~50GB), dense datasets with lots of
> > >> columns when a lot of the queries require full table scans ?
> > >>
> > >> Thanks!
> > >> Gurjeet
+
Gurjeet Singh 2012-08-13, 04:41
+
Mohammad Tariq 2012-08-12, 23:34
+
Jacques 2012-08-12, 22:59
+
Stack 2012-08-12, 08:17
+
Gurjeet Singh 2012-08-12, 12:32
+
Ted Yu 2012-08-12, 14:11
+
Gurjeet Singh 2012-08-12, 14:23
+
Jacques 2012-08-12, 21:05
+
Gurjeet Singh 2012-08-12, 22:46