Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Pagination with HBase - getting previous page of data


Copy link to this message
-
Re: Pagination with HBase - getting previous page of data
>lets say for a scan setCaching is
10 and scan is done across two regions. 9 Results(satisfying the filter)
are in Region1 and 10 Results(satisfying the filter) are in Region2. Then
will this scan return 19 (9+10) results?

@Anil.
No it will return 10 results only not 19. The client here takes into
account the no# of results got from previous region. But a filter is
different. The filter has no logic to do at the client side. It fully
executed at server side. This is the way it is designed. Personally I would
prefer to do the pagination by app alone by using plain scan with caching
(to avoid so many RPCs) and app level logic.

-Anoop-

On Sat, Feb 2, 2013 at 1:32 PM, anil gupta <[EMAIL PROTECTED]> wrote:

> Hi Anoop,
>
> Please find my reply inline.
>
> Thanks,
> Anil
>
> On Wed, Jan 30, 2013 at 3:31 AM, Anoop Sam John <[EMAIL PROTECTED]>
> wrote:
>
> > @Anil
> >
> > >I could not understand that why it goes to multiple regionservers in
> > parallel. Why it cannot guarantee results <= page size( my guess: due to
> > multiple RS scans)? If you have used it then maybe you can explain the
> > behaviour?
> >
> > Scan from client side never go to multiple RS in parallel. Scan from
> > HTable API will be sequential with one region after the other. For every
> > region it will open up scanner in the RS and do next() calls. The filter
> > will be instantiated at server side per region level ...
> >
> > When u need 100 rows in the page and you created a Scan at client side
> > with the filter and suppose there are 2 regions, 1st the scanner is
> opened
> > at for region1 and scan is happening. It will ensure that max 100 rows
> will
> > be retrieved from that region.  But when the region boundary crosses and
> > client automatically open up scanner for the region2, there also it will
> > pass filter with max 100 rows and so from there also max 100 rows can
> > come..  So over all at the client side we can not guartee that the scan
> > created will only scan 100 rows as a whole from the table.
> >
>
> I agree with other people on this email chain that the 2nd region should
> only return (100 - no. of rows returned by Region1), if possible.
>
> When the region boundary crosses and client automatically open up scanner
> for the region2, why doesnt the scanner in Region2 knows that some of the
> rows are already fetched by region1. Do you mean to say that by default,
> for a scan spanning multiple regions, every region has it's own count of
> no.of rows that its going to return? i.e. lets say for a scan setCaching is
> 10 and scan is done across two regions. 9 Results(satisfying the filter)
> are in Region1 and 10 Results(satisfying the filter) are in Region2. Then
> will this scan return 19 (9+10) results?
>
> >
> > I think I am making it clear.   I have not PageFilter at all.. I am just
> > explaining as per the knowledge on scan flow and the general filter
> usage.
> >
> > "This is because the filter is applied separately on different region
> > servers. It does however optimize the scan of individual HRegions by
> making
> > sure that the page size is never exceeded locally. "
> >
> > I guess it need to be saying that   "This is because the filter is
> applied
> > separately on different regions".
> >
> > -Anoop-
> >
> > ________________________________________
> > From: anil gupta [[EMAIL PROTECTED]]
> > Sent: Wednesday, January 30, 2013 1:33 PM
> > To: [EMAIL PROTECTED]
> > Subject: Re: Pagination with HBase - getting previous page of data
> >
> > Hi Mohammad,
> >
> > You are most welcome to join the discussion. I have never used PageFilter
> > so i don't really have concrete input.
> > I had a look at
> >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PageFilter.html
> > I could not understand that why it goes to multiple regionservers in
> > parallel. Why it cannot guarantee results <= page size( my guess: due to
> > multiple RS scans)? If you have used it then maybe you can explain the
> > behaviour?