Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Pagination with HBase - getting previous page of data


+
Vijay Ganesan 2013-01-25, 04:58
+
Mohammad Tariq 2013-01-25, 05:12
+
Jean-Marc Spaggiari 2013-01-25, 12:38
+
anil gupta 2013-01-25, 17:07
+
Jean-Marc Spaggiari 2013-01-25, 17:17
+
anil gupta 2013-01-25, 17:43
+
Jean-Marc Spaggiari 2013-01-26, 02:58
+
anil gupta 2013-01-28, 03:31
+
Jean-Marc Spaggiari 2013-01-29, 21:08
+
anil gupta 2013-01-29, 21:16
+
Jean-Marc Spaggiari 2013-01-29, 21:40
+
anil gupta 2013-01-30, 07:49
+
Mohammad Tariq 2013-01-30, 03:32
+
anil gupta 2013-01-30, 08:03
+
Anoop Sam John 2013-01-30, 11:31
+
Jean-Marc Spaggiari 2013-01-30, 12:18
+
Toby Lazar 2013-01-30, 12:42
+
Asaf Mesika 2013-02-03, 14:07
+
Anoop Sam John 2013-01-31, 03:23
+
anil gupta 2013-02-02, 08:02
+
Anoop John 2013-02-03, 16:07
+
anil gupta 2013-02-03, 17:21
+
Toby Lazar 2013-02-03, 17:25
Copy link to this message
-
Re: Pagination with HBase - getting previous page of data
Inline...
On Sun, Feb 3, 2013 at 9:25 AM, Toby Lazar <[EMAIL PROTECTED]> wrote:

> Quick question - if you perform the pagination client-side and just
> call scanner.iterator().next()
> to get to the necessary results, doesn't this add unecessary network
> traffic of the unused results?
Anil: It depends on the solution. If 95% your scans are limited to a single
region then there wont be unnecessary Network I/O.

>  If you want results 100-120, does the
> client need to first read results 1-100 over the network?
Anil: If you do a simple scan and you want result 100-120 then i would say
yes. Maybe you only get 100-120 by using pagination filter or writing some
custom filter or coprocessor. As, i have mentioned earlier in this post
that we wont be allowing the user to jump to100-120 directly. So, first the
user needs to go through 1-100 results. Hence, i will know the rowkey of
100th results and "rowkey of 100th results" will become my startKey for
100-120 results. So, no unnecessary network I/O.

>  Couldn't a
> filter help prevent some of that unneeded traffic?  Or, is the data only
> transferred when inspecting the result object?
>

Anil: Filters might help reduce unnecessary traffic. It all depends on your
use case.

>
> Thanks,
>
> Toby
> On Sun, Feb 3, 2013 at 11:07 AM, Anoop John <[EMAIL PROTECTED]> wrote:
>
> > >lets say for a scan setCaching is
> > 10 and scan is done across two regions. 9 Results(satisfying the filter)
> > are in Region1 and 10 Results(satisfying the filter) are in Region2. Then
> > will this scan return 19 (9+10) results?
> >
> > @Anil.
> > No it will return 10 results only not 19. The client here takes into
> > account the no# of results got from previous region. But a filter is
> > different. The filter has no logic to do at the client side. It fully
> > executed at server side. This is the way it is designed. Personally I
> would
> > prefer to do the pagination by app alone by using plain scan with caching
> > (to avoid so many RPCs) and app level logic.
> >
> > -Anoop-
> >
> > On Sat, Feb 2, 2013 at 1:32 PM, anil gupta <[EMAIL PROTECTED]>
> wrote:
> >
> > > Hi Anoop,
> > >
> > > Please find my reply inline.
> > >
> > > Thanks,
> > > Anil
> > >
> > > On Wed, Jan 30, 2013 at 3:31 AM, Anoop Sam John <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > @Anil
> > > >
> > > > >I could not understand that why it goes to multiple regionservers in
> > > > parallel. Why it cannot guarantee results <= page size( my guess: due
> > to
> > > > multiple RS scans)? If you have used it then maybe you can explain
> the
> > > > behaviour?
> > > >
> > > > Scan from client side never go to multiple RS in parallel. Scan from
> > > > HTable API will be sequential with one region after the other. For
> > every
> > > > region it will open up scanner in the RS and do next() calls. The
> > filter
> > > > will be instantiated at server side per region level ...
> > > >
> > > > When u need 100 rows in the page and you created a Scan at client
> side
> > > > with the filter and suppose there are 2 regions, 1st the scanner is
> > > opened
> > > > at for region1 and scan is happening. It will ensure that max 100
> rows
> > > will
> > > > be retrieved from that region.  But when the region boundary crosses
> > and
> > > > client automatically open up scanner for the region2, there also it
> > will
> > > > pass filter with max 100 rows and so from there also max 100 rows can
> > > > come..  So over all at the client side we can not guartee that the
> scan
> > > > created will only scan 100 rows as a whole from the table.
> > > >
> > >
> > > I agree with other people on this email chain that the 2nd region
> should
> > > only return (100 - no. of rows returned by Region1), if possible.
> > >
> > > When the region boundary crosses and client automatically open up
> scanner
> > > for the region2, why doesnt the scanner in Region2 knows that some of
> the
> > > rows are already fetched by region1. Do you mean to say that by
> default,

Thanks & Regards,
Anil Gupta