Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Pagination with HBase - getting previous page of data


Copy link to this message
-
Re: Pagination with HBase - getting previous page of data
Toby Lazar 2013-01-30, 12:42
Sounds like if you had 1000 regions, each with 99 rows, and you asked
for 100 that you'd get back 99,000. My guess is that a Filter is
serialized once and that is sent successively to each region and that
it isn't updated between regions.  Don't think doing that would be too
easy.

Toby

On 1/30/13, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote:
> Hi Anoop,
>
> So does it mean the scanner can send back LIMIT*2-1 lines max? Reading
> 100 rows from the 2nd region is using extra time and resources. Why
> not ask for only the number of missing lines?
>
> JM
>
> 2013/1/30, Anoop Sam John <[EMAIL PROTECTED]>:
>> @Anil
>>
>>>I could not understand that why it goes to multiple regionservers in
>> parallel. Why it cannot guarantee results <= page size( my guess: due to
>> multiple RS scans)? If you have used it then maybe you can explain the
>> behaviour?
>>
>> Scan from client side never go to multiple RS in parallel. Scan from
>> HTable
>> API will be sequential with one region after the other. For every region
>> it
>> will open up scanner in the RS and do next() calls. The filter will be
>> instantiated at server side per region level ...
>>
>> When u need 100 rows in the page and you created a Scan at client side
>> with
>> the filter and suppose there are 2 regions, 1st the scanner is opened at
>> for
>> region1 and scan is happening. It will ensure that max 100 rows will be
>> retrieved from that region.  But when the region boundary crosses and
>> client
>> automatically open up scanner for the region2, there also it will pass
>> filter with max 100 rows and so from there also max 100 rows can come..
>> So
>> over all at the client side we can not guartee that the scan created will
>> only scan 100 rows as a whole from the table.
>>
>> I think I am making it clear.   I have not PageFilter at all.. I am just
>> explaining as per the knowledge on scan flow and the general filter
>> usage.
>>
>> "This is because the filter is applied separately on different region
>> servers. It does however optimize the scan of individual HRegions by
>> making
>> sure that the page size is never exceeded locally. "
>>
>> I guess it need to be saying that   "This is because the filter is
>> applied
>> separately on different regions".
>>
>> -Anoop-
>>
>> ________________________________________
>> From: anil gupta [[EMAIL PROTECTED]]
>> Sent: Wednesday, January 30, 2013 1:33 PM
>> To: [EMAIL PROTECTED]
>> Subject: Re: Pagination with HBase - getting previous page of data
>>
>> Hi Mohammad,
>>
>> You are most welcome to join the discussion. I have never used PageFilter
>> so i don't really have concrete input.
>> I had a look at
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PageFilter.html
>> I could not understand that why it goes to multiple regionservers in
>> parallel. Why it cannot guarantee results <= page size( my guess: due to
>> multiple RS scans)? If you have used it then maybe you can explain the
>> behaviour?
>>
>> Thanks,
>> Anil
>>
>> On Tue, Jan 29, 2013 at 7:32 PM, Mohammad Tariq <[EMAIL PROTECTED]>
>> wrote:
>>
>>> I'm kinda hesitant to put my leg in between the pros ;)But, does it
>>> sound
>>> sane to use PageFilter for both rows and columns and having some
>>> additional
>>> logic to handle the 'nth' page logic?It'll help us in both kind of
>>> paging.
>>>
>>> On Wednesday, January 30, 2013, Jean-Marc Spaggiari <
>>> [EMAIL PROTECTED]>
>>> wrote:
>>> > Hi Anil,
>>> >
>>> > I think it really depend on the way you want to use the pagination.
>>> >
>>> > Do you need to be able to jump to page X? Are you ok if you miss a
>>> > line or 2? Is your data growing fastly? Or slowly? Is it ok if your
>>> > page indexes are a day old? Do you need to paginate over 300 colums?
>>> > Or just 1? Do you need to always have the exact same number of entries
>>> > in each page?
>>> >
>>> > For my usecase I need to be able to jump to the page X and I don't
>>> > have any content. I have hundred of millions lines. Only the rowkey

Sent from my mobile device