Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Pagination with HBase - getting previous page of data


Copy link to this message
-
Re: Pagination with HBase - getting previous page of data
Hi Anoop,

So does it mean the scanner can send back LIMIT*2-1 lines max? Reading
100 rows from the 2nd region is using extra time and resources. Why
not ask for only the number of missing lines?

JM

2013/1/30, Anoop Sam John <[EMAIL PROTECTED]>:
> @Anil
>
>>I could not understand that why it goes to multiple regionservers in
> parallel. Why it cannot guarantee results <= page size( my guess: due to
> multiple RS scans)? If you have used it then maybe you can explain the
> behaviour?
>
> Scan from client side never go to multiple RS in parallel. Scan from HTable
> API will be sequential with one region after the other. For every region it
> will open up scanner in the RS and do next() calls. The filter will be
> instantiated at server side per region level ...
>
> When u need 100 rows in the page and you created a Scan at client side with
> the filter and suppose there are 2 regions, 1st the scanner is opened at for
> region1 and scan is happening. It will ensure that max 100 rows will be
> retrieved from that region.  But when the region boundary crosses and client
> automatically open up scanner for the region2, there also it will pass
> filter with max 100 rows and so from there also max 100 rows can come..  So
> over all at the client side we can not guartee that the scan created will
> only scan 100 rows as a whole from the table.
>
> I think I am making it clear.   I have not PageFilter at all.. I am just
> explaining as per the knowledge on scan flow and the general filter usage.
>
> "This is because the filter is applied separately on different region
> servers. It does however optimize the scan of individual HRegions by making
> sure that the page size is never exceeded locally. "
>
> I guess it need to be saying that   "This is because the filter is applied
> separately on different regions".
>
> -Anoop-
>
> ________________________________________
> From: anil gupta [[EMAIL PROTECTED]]
> Sent: Wednesday, January 30, 2013 1:33 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Pagination with HBase - getting previous page of data
>
> Hi Mohammad,
>
> You are most welcome to join the discussion. I have never used PageFilter
> so i don't really have concrete input.
> I had a look at
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PageFilter.html
> I could not understand that why it goes to multiple regionservers in
> parallel. Why it cannot guarantee results <= page size( my guess: due to
> multiple RS scans)? If you have used it then maybe you can explain the
> behaviour?
>
> Thanks,
> Anil
>
> On Tue, Jan 29, 2013 at 7:32 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:
>
>> I'm kinda hesitant to put my leg in between the pros ;)But, does it sound
>> sane to use PageFilter for both rows and columns and having some
>> additional
>> logic to handle the 'nth' page logic?It'll help us in both kind of
>> paging.
>>
>> On Wednesday, January 30, 2013, Jean-Marc Spaggiari <
>> [EMAIL PROTECTED]>
>> wrote:
>> > Hi Anil,
>> >
>> > I think it really depend on the way you want to use the pagination.
>> >
>> > Do you need to be able to jump to page X? Are you ok if you miss a
>> > line or 2? Is your data growing fastly? Or slowly? Is it ok if your
>> > page indexes are a day old? Do you need to paginate over 300 colums?
>> > Or just 1? Do you need to always have the exact same number of entries
>> > in each page?
>> >
>> > For my usecase I need to be able to jump to the page X and I don't
>> > have any content. I have hundred of millions lines. Only the rowkey
>> > matter for me and I'm fine if sometime I have 50 entries displayed,
>> > and sometime only 45. So I'm thinking about calculating which row is
>> > the first one for each page, and store that separatly. Then I just
>> > need to run the MR daily.
>> >
>> > It's not a perfect solution I agree, but this might do the job for me.
>> > I'm totally open to all other idea which might do the job to.
>> >
>> > JM
>> >
>> > 2013/1/29, anil gupta <[EMAIL PROTECTED]>:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB