Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Syntax and HBaseStorage questions


Copy link to this message
-
Syntax and HBaseStorage questions
Hi,

Consider this use case:

There is a program store cpu usage metrics to a HBase table.  This
HBase table has a column family called cpu, and individual cpu core
usage is stored in columns like, cpu:user.0, cpu:user.1 etc.  The
suffix number represent unique cpu core id in the system.

While it is possible to write query like:

SystemMetrics = load 'hbase://SystemMetrics' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster
cpu:combined.0 cpu:combined.1 ... system:LoadAverage.1','-loadKey') AS
(rowKey: chararray, cluster: chararray, cpuCombined0:float,
cpuCombined1:float ... LoadAverage:float);

To get a long list of columns to load and specify the same list in
group by command like:

CleanseBuffer = foreach SystemMetrics generate
REGEX_EXTRACT($0,'^\\d+',0) as time, cluster, cpuCombined0,
cpuCombined1, ..., LoadAverage;

The syntax works fine, but it would be nice to load all columns of a
given column family without specifying individual columns.

i.e. SystemMetrics = load 'hbase://SystemMetrics' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster cpu
system');

Is this syntax possible to implement in pig?

Second question, is it possible to make alteration of a tuple in a
bag, but not specifying other tuples in the same bag?

For large column tables, it would be nice if there is short hand
syntax to make pig syntax shorter to write.
Any tip on making foreach and group by shorter?  Thanks

regards,
Eric