Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Client Get vs Coprocessor scan performance


Copy link to this message
-
Re: Client Get vs Coprocessor scan performance
Kiru,
What's your column family name? Just to confirm, the column qualifier of
your key value is C_10345 and this stores a value as a Double using
Bytes.toBytes(double)? Are any of the Double values negative? Any other key
values?

Can you give me an idea of the kind of fuzzy filtering you're doing on the
7 char row key? We may want to model that as a set of row key columns in
Phoenix to leverage the skip scan more.

How about I model your aggregation as an AVG over a group of rows? What
would your GROUP BY expression look like? Are you grouping based on a part
of the 7 char row key? Or on some other key value?

Thanks,
James
On Sun, Aug 18, 2013 at 2:16 PM, Kiru Pakkirisamy <[EMAIL PROTECTED]
> wrote:

> James,
> Rowkey - String - len - 7
> Col = String - variable length - but looks C_10345
> Col value = Double
>
> If I can create a Phoenix schema mapping to this existing table that would
> be great. I actually do a group by the column values and return another
> value which is a function of the value and an input double value. Input is
> a Map<String, Double> and return is also a Map<String, Double>.
>
>
> Regards,
> - kiru
>
>
> Kiru Pakkirisamy | webcloudtech.wordpress.com
>
>   ------------------------------
>  *From:* James Taylor <[EMAIL PROTECTED]>
> *To:* [EMAIL PROTECTED]; Kiru Pakkirisamy <[EMAIL PROTECTED]>
> *Sent:* Sunday, August 18, 2013 2:07 PM
>
> *Subject:* Re: Client Get vs Coprocessor scan performance
>
> Kiru,
> If you're able to post the key values, row key structure, and data types
> you're using, I can post the Phoenix code to query against it. You're doing
> some kind of aggregation too, right? If you could explain that part too,
> that would be helpful. It's likely that you can just query the existing
> HBase data you've already created on the same cluster you're already using
> (provided you put the phoenix jar on all the region servers - use our 2.0.0
> version that just came out). Might be interesting to compare the amount of
> code necessary in each approach as well.
> Thanks,
> James
>
>
> On Sun, Aug 18, 2013 at 12:16 PM, Kiru Pakkirisamy <
> [EMAIL PROTECTED]> wrote:
>
> James,
> I am using the FuzzyRowFilter or the Gets within  a Coprocessor. Looks
> like I cannot use your SkipScanFilter by itself as it has lots of phoenix
> imports. I thought of writing my own Custom filter and saw that the
> FuzzyRowFilter in the 0.94 branch also had an implementation for
> getNextKeyHint(),  only that it works well only with fixed length keys if I
> wanted a complete match of the keys. After my padding my keys to fixed
> length it seems to be fine.
> Once I confirm some key locality and other issues (like heap), I will try
> to bench mark this table alone against Phoenix on another cluster. Thanks.
>
> Regards,
> - kiru
>
>
> Kiru Pakkirisamy | webcloudtech.wordpress.com
>
>
> ________________________________
>  From: James Taylor <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Cc: Kiru Pakkirisamy <[EMAIL PROTECTED]>
> Sent: Sunday, August 18, 2013 11:44 AM
> Subject: Re: Client Get vs Coprocessor scan performance
>
>
> Would be interesting to compare against Phoenix's Skip Scan
> (
> http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html
> )
> which does a scan through a coprocessor and is more than 2x faster
> than multi Get (plus handles multi-range scans in addition to point
> gets).
>
> James
>
> On Aug 18, 2013, at 6:39 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > bq. Get'ting 100 rows seems to be faster than the FuzzyRowFilter (mask on
> > the whole length of the key)
> >
> > In this case the Get's are very selective. The number of rows
> FuzzyRowFilter
> > was evaluated against would be much higher.
> > It would be nice if you remember the time each took.
> >
> > bq. Also, I am seeing very bad concurrent query performance
> >
> > Were the multi Get's performed by your coprocessor within region boundary