Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Schema design for filters


Copy link to this message
-
Re: Schema design for filters
Michael Segel 2013-06-27, 22:58
Ok...

If you want to do type checking and schema enforcement...

You will need to do this as a coprocessor.

The quick and dirty way... (Not recommended) would be to hard code the schema in to the co-processor code.)

A better way... at start up, load up ZK to manage the set of known table schemas which would be a map of column qualifier to data type.
(If JSON then you need to do a separate lookup to get the records schema)

Then a single java class that does the look up and then handles the known data type comparators.

Does this make sense?
(Sorry, kinda was thinking this out as I typed the response. But it should work )

At least it would be a design approach I would talk. YMMV

Having said that, I expect someone to say its a bad idea and that they have a better solution.

HTH

-Mike

On Jun 27, 2013, at 5:13 PM, Kristoffer Sjögren <[EMAIL PROTECTED]> wrote:

> I see your point. Everything is just bytes.
>
> However, the schema is known and every row is formatted according to this
> schema, although some columns may not exist, that is, no value exist for
> this property on this row.
>
> So if im able to apply these "typed comparators" to the right cell values
> it may be possible? But I cant find a filter that target specific columns?
>
> Seems like all filters scan every column/qualifier and there is no way of
> knowing what column is currently being evaluated?
>
>
> On Thu, Jun 27, 2013 at 11:51 PM, Michael Segel
> <[EMAIL PROTECTED]>wrote:
>
>> You have to remember that HBase doesn't enforce any sort of typing.
>> That's why this can be difficult.
>>
>> You'd have to write a coprocessor to enforce a schema on a table.
>> Even then YMMV if you're writing JSON structures to a column because while
>> the contents of the structures could be the same, the actual strings could
>> differ.
>>
>> HTH
>>
>> -Mike
>>
>> On Jun 27, 2013, at 4:41 PM, Kristoffer Sjögren <[EMAIL PROTECTED]> wrote:
>>
>>> I realize standard comparators cannot solve this.
>>>
>>> However I do know the type of each column so writing custom list
>>> comparators for boolean, char, byte, short, int, long, float, double
>> seems
>>> quite straightforward.
>>>
>>> Long arrays, for example, are stored as a byte array with 8 bytes per
>> item
>>> so a comparator might look like this.
>>>
>>> public class LongsComparator extends WritableByteArrayComparable {
>>>   public int compareTo(byte[] value, int offset, int length) {
>>>       long[] values = BytesUtils.toLongs(value, offset, length);
>>>       for (long longValue : values) {
>>>           if (longValue == val) {
>>>               return 0;
>>>           }
>>>       }
>>>       return 1;
>>>   }
>>> }
>>>
>>> public static long[] toLongs(byte[] value, int offset, int length) {
>>>   int num = (length - offset) / 8;
>>>   long[] values = new long[num];
>>>   for (int i = offset; i < num; i++) {
>>>       values[i] = getLong(value, i * 8);
>>>   }
>>>   return values;
>>> }
>>>
>>>
>>> Strings are similar but would require charset and length for each string.
>>>
>>> public class StringsComparator extends WritableByteArrayComparable  {
>>>   public int compareTo(byte[] value, int offset, int length) {
>>>       String[] values = BytesUtils.toStrings(value, offset, length);
>>>       for (String stringValue : values) {
>>>           if (val.equals(stringValue)) {
>>>               return 0;
>>>           }
>>>       }
>>>       return 1;
>>>   }
>>> }
>>>
>>> public static String[] toStrings(byte[] value, int offset, int length) {
>>>   ArrayList<String> values = new ArrayList<String>();
>>>   int idx = 0;
>>>   ByteBuffer buffer = ByteBuffer.wrap(value, offset, length);
>>>   while (idx < length) {
>>>       int size = buffer.getInt();
>>>       byte[] bytes = new byte[size];
>>>       buffer.get(bytes);
>>>       values.add(new String(bytes));
>>>       idx += 4 + size;
>>>   }
>>>   return values.toArray(new String[values.size()]);
>>> }
>>>
>>>
>>> Am I on the right track or maybe overlooking some implementation details?