Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Pagination with HBase - getting previous page of data


Copy link to this message
-
Re: Pagination with HBase - getting previous page of data
Inline...
On Sun, Feb 3, 2013 at 9:25 AM, Toby Lazar <[EMAIL PROTECTED]> wrote:

> Quick question - if you perform the pagination client-side and just
> call scanner.iterator().next()
> to get to the necessary results, doesn't this add unecessary network
> traffic of the unused results?
Anil: It depends on the solution. If 95% your scans are limited to a single
region then there wont be unnecessary Network I/O.

>  If you want results 100-120, does the
> client need to first read results 1-100 over the network?
Anil: If you do a simple scan and you want result 100-120 then i would say
yes. Maybe you only get 100-120 by using pagination filter or writing some
custom filter or coprocessor. As, i have mentioned earlier in this post
that we wont be allowing the user to jump to100-120 directly. So, first the
user needs to go through 1-100 results. Hence, i will know the rowkey of
100th results and "rowkey of 100th results" will become my startKey for
100-120 results. So, no unnecessary network I/O.

>  Couldn't a
> filter help prevent some of that unneeded traffic?  Or, is the data only
> transferred when inspecting the result object?
>

Anil: Filters might help reduce unnecessary traffic. It all depends on your
use case.

>
> Thanks,
>
> Toby
> On Sun, Feb 3, 2013 at 11:07 AM, Anoop John <[EMAIL PROTECTED]> wrote:
>
> > >lets say for a scan setCaching is
> > 10 and scan is done across two regions. 9 Results(satisfying the filter)
> > are in Region1 and 10 Results(satisfying the filter) are in Region2. Then
> > will this scan return 19 (9+10) results?
> >
> > @Anil.
> > No it will return 10 results only not 19. The client here takes into
> > account the no# of results got from previous region. But a filter is
> > different. The filter has no logic to do at the client side. It fully
> > executed at server side. This is the way it is designed. Personally I
> would
> > prefer to do the pagination by app alone by using plain scan with caching
> > (to avoid so many RPCs) and app level logic.
> >
> > -Anoop-
> >
> > On Sat, Feb 2, 2013 at 1:32 PM, anil gupta <[EMAIL PROTECTED]>
> wrote:
> >
> > > Hi Anoop,
> > >
> > > Please find my reply inline.
> > >
> > > Thanks,
> > > Anil
> > >
> > > On Wed, Jan 30, 2013 at 3:31 AM, Anoop Sam John <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > @Anil
> > > >
> > > > >I could not understand that why it goes to multiple regionservers in
> > > > parallel. Why it cannot guarantee results <= page size( my guess: due
> > to
> > > > multiple RS scans)? If you have used it then maybe you can explain
> the
> > > > behaviour?
> > > >
> > > > Scan from client side never go to multiple RS in parallel. Scan from
> > > > HTable API will be sequential with one region after the other. For
> > every
> > > > region it will open up scanner in the RS and do next() calls. The
> > filter
> > > > will be instantiated at server side per region level ...
> > > >
> > > > When u need 100 rows in the page and you created a Scan at client
> side
> > > > with the filter and suppose there are 2 regions, 1st the scanner is
> > > opened
> > > > at for region1 and scan is happening. It will ensure that max 100
> rows
> > > will
> > > > be retrieved from that region.  But when the region boundary crosses
> > and
> > > > client automatically open up scanner for the region2, there also it
> > will
> > > > pass filter with max 100 rows and so from there also max 100 rows can
> > > > come..  So over all at the client side we can not guartee that the
> scan
> > > > created will only scan 100 rows as a whole from the table.
> > > >
> > >
> > > I agree with other people on this email chain that the 2nd region
> should
> > > only return (100 - no. of rows returned by Region1), if possible.
> > >
> > > When the region boundary crosses and client automatically open up
> scanner
> > > for the region2, why doesnt the scanner in Region2 knows that some of
> the
> > > rows are already fetched by region1. Do you mean to say that by
> default,

Thanks & Regards,
Anil Gupta
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB