Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> [potential bug]Find rows which do not have any of the given columns


Copy link to this message
-
[potential bug]Find rows which do not have any of the given columns
- user
+dev

Hi Devs,

Please follow the discussion to get full context. tl:dr "Did a scan with
timerange and filters, scan o/p was incorrect. Repeated scan with filter
only, scan o/p was correct."

HBase version : 0.90.3
Hadoop : CDH3u0
Issues:
The scan when set with both a time range and a filter can behave in
an unintuitive way. Calling it unintuitive instead of wrong, since I do not
know if this is a known limitation of scan. Picture a filter setup like
mine - "Filter rows which have cells pertaining to certain columns". This
filter is set on a scan which has a time range constraint as well.  AFAIK
we skip Hfiles based on metadata when dealing with time ranges. If a region
has two Hfiles. One of the Hfiles has cells for unwanted columns but the
other one does not - we may get incorrect result based on what how time
range is set (If the time range scan optimizer skips the Hfile containing
unwanted cells).

Does this sound like a valid issue? Also I can see this happening to more
than one kind of SkipFilters.

-Shrijeet
On Mon, Aug 6, 2012 at 11:38 AM, Shrijeet Paliwal
<[EMAIL PROTECTED]>wrote:

> It seems setting time range is a problem , I was doing  (*
> scan.setTimeRange(Long.**valueOf(args[4]), Long.valueOf(args[5]));)*
> *
> *
> I was working on assumption that filter logic works before scan logic, in
> other words a KV dropped by filter will not make it to scan. In case of
> time range this might not be true.
>
> -Shrijeet
>
>
> On Mon, Aug 6, 2012 at 9:25 AM, jmozah <[EMAIL PROTECTED]> wrote:
>
>> Hmmm.. Missed it. Otherwise i dont spot anything wrong in this.
>> are you sure about the column names?
>>
>> ./zahoor
>>
>>
>> On 06-Aug-2012, at 9:34 PM, Shrijeet Paliwal <[EMAIL PROTECTED]>
>> wrote:
>>
>> > I am using FilterList. Could you elaborate?
>> >
>> > On Mon, Aug 6, 2012 at 8:48 AM, jmozah <[EMAIL PROTECTED]> wrote:
>> >
>> >>
>> >>
>> >> Use FilterList instead of List of Filters.
>> >>
>> >> ./Zahoor
>> >>
>> >> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <[EMAIL PROTECTED]
>> >
>> >> wrote:
>> >>
>> >>> Hi All,
>> >>>
>> >>> I am writing a job which finds rows that do not have a cell
>> corresponding
>> >>> to any of the columns in the given set of columns.
>> >>> This is how I have configured my scan (a combination of
>> lQualifierFilters
>> >>> and SkipFilter)
>> >>>
>> >>>   columnsSet = Splitter.on(',') .split(columns); //columns is a csv
>> >>> containing column names
>> >>>   List<Filter> qualifierFilters = new ArrayList<Filter>();
>> >>>   for (String qual : columnsSet) {
>> >>>     qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
>> >>>         new BinaryComparator(Bytes.toBytes(qual))));
>> >>>   }
>> >>>   Filter skipFilter = new SkipFilter(new
>> >>> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
>> >>>   Scan scan = new Scan();
>> >>>   scan.addFamily(Bytes.toBytes(family));
>> >>>   scan.setCacheBlocks(false);
>> >>>   scan.setCaching(1000);
>> >>>   scan.setFilter(skipFilter);
>> >>>   scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
>> >>>
>> >>> In my test table the scan worked as expected. But in production run, I
>> >> got
>> >>> rows which had cells containing one of the given qualifiers (not
>> >> expected)
>> >>> Can some one help me spot the mistake?
>> >>>
>> >>> -Shrijeet
>> >>
>> >>
>>
>>
>