Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> [potential bug]Find rows which do not have any of the given columns


Copy link to this message
-
Re: [potential bug]Find rows which do not have any of the given columns
Zahoor,

Thank you for the input. I still feel it is counter intuitive.

-Shrijeet

On Tue, Aug 7, 2012 at 2:57 AM, J Mohamed Zahoor <[EMAIL PROTECTED]> wrote:
> Hi
>
> Nice one. But i think this is valid behavior.
> Time ranges are something which qualifies certain rows to be made available
> to the client (something which is related to MVCC).
> Once a certain rows are qualified... then the filters are applied on them.
>
> The fact that both can be set simultaneously on a "Scan" object hints that
> they orthogonal.
>
> ./zahoor
>
> On Tue, Aug 7, 2012 at 2:10 AM, Shrijeet Paliwal <[EMAIL PROTECTED]>wrote:
>
>> - user
>> +dev
>>
>> Hi Devs,
>>
>> Please follow the discussion to get full context. tl:dr "Did a scan with
>> timerange and filters, scan o/p was incorrect. Repeated scan with filter
>> only, scan o/p was correct."
>>
>> HBase version : 0.90.3
>> Hadoop : CDH3u0
>> Issues:
>> The scan when set with both a time range and a filter can behave in
>> an unintuitive way. Calling it unintuitive instead of wrong, since I do not
>> know if this is a known limitation of scan. Picture a filter setup like
>> mine - "Filter rows which have cells pertaining to certain columns". This
>> filter is set on a scan which has a time range constraint as well.  AFAIK
>> we skip Hfiles based on metadata when dealing with time ranges. If a region
>> has two Hfiles. One of the Hfiles has cells for unwanted columns but the
>> other one does not - we may get incorrect result based on what how time
>> range is set (If the time range scan optimizer skips the Hfile containing
>> unwanted cells).
>>
>> Does this sound like a valid issue? Also I can see this happening to more
>> than one kind of SkipFilters.
>>
>> -Shrijeet
>>
>>
>> On Mon, Aug 6, 2012 at 11:38 AM, Shrijeet Paliwal
>> <[EMAIL PROTECTED]>wrote:
>>
>> > It seems setting time range is a problem , I was doing  (*
>> > scan.setTimeRange(Long.**valueOf(args[4]), Long.valueOf(args[5]));)*
>> > *
>> > *
>> > I was working on assumption that filter logic works before scan logic, in
>> > other words a KV dropped by filter will not make it to scan. In case of
>> > time range this might not be true.
>> >
>> > -Shrijeet
>> >
>> >
>> > On Mon, Aug 6, 2012 at 9:25 AM, jmozah <[EMAIL PROTECTED]> wrote:
>> >
>> >> Hmmm.. Missed it. Otherwise i dont spot anything wrong in this.
>> >> are you sure about the column names?
>> >>
>> >> ./zahoor
>> >>
>> >>
>> >> On 06-Aug-2012, at 9:34 PM, Shrijeet Paliwal <[EMAIL PROTECTED]>
>> >> wrote:
>> >>
>> >> > I am using FilterList. Could you elaborate?
>> >> >
>> >> > On Mon, Aug 6, 2012 at 8:48 AM, jmozah <[EMAIL PROTECTED]> wrote:
>> >> >
>> >> >>
>> >> >>
>> >> >> Use FilterList instead of List of Filters.
>> >> >>
>> >> >> ./Zahoor
>> >> >>
>> >> >> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <
>> [EMAIL PROTECTED]
>> >> >
>> >> >> wrote:
>> >> >>
>> >> >>> Hi All,
>> >> >>>
>> >> >>> I am writing a job which finds rows that do not have a cell
>> >> corresponding
>> >> >>> to any of the columns in the given set of columns.
>> >> >>> This is how I have configured my scan (a combination of
>> >> lQualifierFilters
>> >> >>> and SkipFilter)
>> >> >>>
>> >> >>>   columnsSet = Splitter.on(',') .split(columns); //columns is a csv
>> >> >>> containing column names
>> >> >>>   List<Filter> qualifierFilters = new ArrayList<Filter>();
>> >> >>>   for (String qual : columnsSet) {
>> >> >>>     qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
>> >> >>>         new BinaryComparator(Bytes.toBytes(qual))));
>> >> >>>   }
>> >> >>>   Filter skipFilter = new SkipFilter(new
>> >> >>> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
>> >> >>>   Scan scan = new Scan();
>> >> >>>   scan.addFamily(Bytes.toBytes(family));
>> >> >>>   scan.setCacheBlocks(false);
>> >> >>>   scan.setCaching(1000);
>> >> >>>   scan.setFilter(skipFilter);
>> >> >>>   scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
>> >> >>>
>> >> >>> In my test table the scan worked as expected. But in production
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB