Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - [potential bug]Find rows which do not have any of the given columns


+
Shrijeet Paliwal 2012-08-06, 20:40
+
J Mohamed Zahoor 2012-08-07, 09:57
Copy link to this message
-
Re: [potential bug]Find rows which do not have any of the given columns
Shrijeet Paliwal 2012-08-07, 16:17
Zahoor,

Thank you for the input. I still feel it is counter intuitive.

-Shrijeet

On Tue, Aug 7, 2012 at 2:57 AM, J Mohamed Zahoor <[EMAIL PROTECTED]> wrote:
> Hi
>
> Nice one. But i think this is valid behavior.
> Time ranges are something which qualifies certain rows to be made available
> to the client (something which is related to MVCC).
> Once a certain rows are qualified... then the filters are applied on them.
>
> The fact that both can be set simultaneously on a "Scan" object hints that
> they orthogonal.
>
> ./zahoor
>
> On Tue, Aug 7, 2012 at 2:10 AM, Shrijeet Paliwal <[EMAIL PROTECTED]>wrote:
>
>> - user
>> +dev
>>
>> Hi Devs,
>>
>> Please follow the discussion to get full context. tl:dr "Did a scan with
>> timerange and filters, scan o/p was incorrect. Repeated scan with filter
>> only, scan o/p was correct."
>>
>> HBase version : 0.90.3
>> Hadoop : CDH3u0
>> Issues:
>> The scan when set with both a time range and a filter can behave in
>> an unintuitive way. Calling it unintuitive instead of wrong, since I do not
>> know if this is a known limitation of scan. Picture a filter setup like
>> mine - "Filter rows which have cells pertaining to certain columns". This
>> filter is set on a scan which has a time range constraint as well.  AFAIK
>> we skip Hfiles based on metadata when dealing with time ranges. If a region
>> has two Hfiles. One of the Hfiles has cells for unwanted columns but the
>> other one does not - we may get incorrect result based on what how time
>> range is set (If the time range scan optimizer skips the Hfile containing
>> unwanted cells).
>>
>> Does this sound like a valid issue? Also I can see this happening to more
>> than one kind of SkipFilters.
>>
>> -Shrijeet
>>
>>
>> On Mon, Aug 6, 2012 at 11:38 AM, Shrijeet Paliwal
>> <[EMAIL PROTECTED]>wrote:
>>
>> > It seems setting time range is a problem , I was doing  (*
>> > scan.setTimeRange(Long.**valueOf(args[4]), Long.valueOf(args[5]));)*
>> > *
>> > *
>> > I was working on assumption that filter logic works before scan logic, in
>> > other words a KV dropped by filter will not make it to scan. In case of
>> > time range this might not be true.
>> >
>> > -Shrijeet
>> >
>> >
>> > On Mon, Aug 6, 2012 at 9:25 AM, jmozah <[EMAIL PROTECTED]> wrote:
>> >
>> >> Hmmm.. Missed it. Otherwise i dont spot anything wrong in this.
>> >> are you sure about the column names?
>> >>
>> >> ./zahoor
>> >>
>> >>
>> >> On 06-Aug-2012, at 9:34 PM, Shrijeet Paliwal <[EMAIL PROTECTED]>
>> >> wrote:
>> >>
>> >> > I am using FilterList. Could you elaborate?
>> >> >
>> >> > On Mon, Aug 6, 2012 at 8:48 AM, jmozah <[EMAIL PROTECTED]> wrote:
>> >> >
>> >> >>
>> >> >>
>> >> >> Use FilterList instead of List of Filters.
>> >> >>
>> >> >> ./Zahoor
>> >> >>
>> >> >> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <
>> [EMAIL PROTECTED]
>> >> >
>> >> >> wrote:
>> >> >>
>> >> >>> Hi All,
>> >> >>>
>> >> >>> I am writing a job which finds rows that do not have a cell
>> >> corresponding
>> >> >>> to any of the columns in the given set of columns.
>> >> >>> This is how I have configured my scan (a combination of
>> >> lQualifierFilters
>> >> >>> and SkipFilter)
>> >> >>>
>> >> >>>   columnsSet = Splitter.on(',') .split(columns); //columns is a csv
>> >> >>> containing column names
>> >> >>>   List<Filter> qualifierFilters = new ArrayList<Filter>();
>> >> >>>   for (String qual : columnsSet) {
>> >> >>>     qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
>> >> >>>         new BinaryComparator(Bytes.toBytes(qual))));
>> >> >>>   }
>> >> >>>   Filter skipFilter = new SkipFilter(new
>> >> >>> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
>> >> >>>   Scan scan = new Scan();
>> >> >>>   scan.addFamily(Bytes.toBytes(family));
>> >> >>>   scan.setCacheBlocks(false);
>> >> >>>   scan.setCaching(1000);
>> >> >>>   scan.setFilter(skipFilter);
>> >> >>>   scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
>> >> >>>
>> >> >>> In my test table the scan worked as expected. But in production