Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Client Get vs Coprocessor scan performance


Copy link to this message
-
Re: Client Get vs Coprocessor scan performance
James Taylor 2013-08-19, 15:34
Kiru,
Is the column qualifier for the key value storing the double different
for different rows? Not sure I understand what you're grouping over.
Maybe  5 rows worth of sample input and expected output would help.
Thanks,
James
On Aug 19, 2013, at 1:37 AM, Kiru Pakkirisamy <[EMAIL PROTECTED]> wrote:

> James,
> I have only one family -cp. Yes, that is how I store the Double. No, the doubles are always positive.
> The keys are "A14568 " Less than a million and I added the alphabets to randomize them.
> I group them based on the C_ suffix and say order them by the Double (to simplify it).
> Is there a way  to do a sort of "user defined function" on a column  ? that would take care of my calculation on that double.
> Thanks again.
>
> Regards,
> - kiru
>
>
> Kiru Pakkirisamy | webcloudtech.wordpress.com
>
>
> ________________________________
> From: James Taylor <[EMAIL PROTECTED]>
> To: Kiru Pakkirisamy <[EMAIL PROTECTED]>
> Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Sent: Sunday, August 18, 2013 5:34 PM
> Subject: Re: Client Get vs Coprocessor scan performance
>
>
> Kiru,
> What's your column family name? Just to confirm, the column qualifier of
> your key value is C_10345 and this stores a value as a Double using
> Bytes.toBytes(double)? Are any of the Double values negative? Any other key
> values?
>
> Can you give me an idea of the kind of fuzzy filtering you're doing on the
> 7 char row key? We may want to model that as a set of row key columns in
> Phoenix to leverage the skip scan more.
>
> How about I model your aggregation as an AVG over a group of rows? What
> would your GROUP BY expression look like? Are you grouping based on a part
> of the 7 char row key? Or on some other key value?
>
> Thanks,
> James
>
>
> On Sun, Aug 18, 2013 at 2:16 PM, Kiru Pakkirisamy <[EMAIL PROTECTED]
>> wrote:
>
>> James,
>> Rowkey - String - len - 7
>> Col = String - variable length - but looks C_10345
>> Col value = Double
>>
>> If I can create a Phoenix schema mapping to this existing table that would
>> be great. I actually do a group by the column values and return another
>> value which is a function of the value and an input double value. Input is
>> a Map<String, Double> and return is also a Map<String, Double>.
>>
>>
>> Regards,
>> - kiru
>>
>>
>> Kiru Pakkirisamy | webcloudtech.wordpress.com
>>
>>    ------------------------------
>>   *From:* James Taylor <[EMAIL PROTECTED]>
>> *To:* [EMAIL PROTECTED]; Kiru Pakkirisamy <[EMAIL PROTECTED]>
>> *Sent:* Sunday, August 18, 2013 2:07 PM
>>
>> *Subject:* Re: Client Get vs Coprocessor scan performance
>>
>> Kiru,
>> If you're able to post the key values, row key structure, and data types
>> you're using, I can post the Phoenix code to query against it. You're doing
>> some kind of aggregation too, right? If you could explain that part too,
>> that would be helpful. It's likely that you can just query the existing
>> HBase data you've already created on the same cluster you're already using
>> (provided you put the phoenix jar on all the region servers - use our 2.0.0
>> version that just came out). Might be interesting to compare the amount of
>> code necessary in each approach as well.
>> Thanks,
>> James
>>
>>
>> On Sun, Aug 18, 2013 at 12:16 PM, Kiru Pakkirisamy <
>> [EMAIL PROTECTED]> wrote:
>>
>> James,
>> I am using the FuzzyRowFilter or the Gets within  a Coprocessor. Looks
>> like I cannot use your SkipScanFilter by itself as it has lots of phoenix
>> imports. I thought of writing my own Custom filter and saw that the
>> FuzzyRowFilter in the 0.94 branch also had an implementation for
>> getNextKeyHint(),  only that it works well only with fixed length keys if I
>> wanted a complete match of the keys. After my padding my keys to fixed
>> length it seems to be fine.
>> Once I confirm some key locality and other issues (like heap), I will try
>> to bench mark this table alone against Phoenix on another cluster. Thanks.