Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Syntax and HBaseStorage questions


Copy link to this message
-
Re: Syntax and HBaseStorage questions
Eric Yang 2010-12-30, 05:12
Hi Dmitriy,

Issue filed: https://issues.apache.org/jira/browse/PIG-1782

I meant to say columns in my previous message.  It should read as
"Make alteration of a column in a bug, but not specifying other
columns in the same bag".

Let's assume PIG-1782 is address and CpuMetrics from PIG-1782 example
should contains 250 columns.
The next line that I write, would look like this:

ConcatBuffer = foreach CpuMentrics generate CONCAT(CONCAT($0, '-'),
$1) as rowId, $2, $3, $4, $5, $6, $7, $8, $9, $10, ... $250;

It would be nice if the statement can be written like this:

ConcatBuffer = foreach CpuMentrics generate CONCAT(CONCAT($0, '-'),
$1) as rowID, MIRROR($2..$250);

Is there something like this in pig built-in functions?

regards,
Eric

On Wed, Dec 29, 2010 at 6:09 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> Hi Eric,
> Yes, we can certainly add the convention that a string without a ":" refers
> to a complete column family.
> It should be fairly straightforward.. step 1 is to open a ticket on the
> Jira, step to is to do it :).
>
> I am not sure what you mean by "make alteration of a tuple in a bag, but not
> specifying other tuples in the same bag" -- can you provide an example that
> illustrates what you want to do?
>
> Thanks,
> -Dmitriy
>
> On Tue, Dec 28, 2010 at 11:10 PM, Eric Yang <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> Consider this use case:
>>
>> There is a program store cpu usage metrics to a HBase table.  This
>> HBase table has a column family called cpu, and individual cpu core
>> usage is stored in columns like, cpu:user.0, cpu:user.1 etc.  The
>> suffix number represent unique cpu core id in the system.
>>
>> While it is possible to write query like:
>>
>> SystemMetrics = load 'hbase://SystemMetrics' USING
>> org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster
>> cpu:combined.0 cpu:combined.1 ... system:LoadAverage.1','-loadKey') AS
>> (rowKey: chararray, cluster: chararray, cpuCombined0:float,
>> cpuCombined1:float ... LoadAverage:float);
>>
>> To get a long list of columns to load and specify the same list in
>> group by command like:
>>
>> CleanseBuffer = foreach SystemMetrics generate
>> REGEX_EXTRACT($0,'^\\d+',0) as time, cluster, cpuCombined0,
>> cpuCombined1, ..., LoadAverage;
>>
>> The syntax works fine, but it would be nice to load all columns of a
>> given column family without specifying individual columns.
>>
>> i.e. SystemMetrics = load 'hbase://SystemMetrics' USING
>> org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster cpu
>> system');
>>
>> Is this syntax possible to implement in pig?
>>
>> Second question, is it possible to make alteration of a tuple in a
>> bag, but not specifying other tuples in the same bag?
>>
>> For large column tables, it would be nice if there is short hand
>> syntax to make pig syntax shorter to write.
>> Any tip on making foreach and group by shorter?  Thanks
>>
>> regards,
>> Eric
>>
>