Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Chukwa >> mail # user >> Chukwa Pig Data Passthrough

Copy link to this message
Re: Chukwa Pig Data Passthrough
Do you want to split on the chukwa payload fields or the fields in the
record body?

I have scripts that do similar things with the body using FILTER and a
custom TOKENIZE udf I wrote to tokenize the body content. I'm using
the latest ChukwaLoader for Pig 0.7.0, but the previous one should
work the same way.

define chukwaLoader org.apache.hadoop.chukwa.pig.ChukwaLoader();
define tokenize     my.udfs.TOKENIZE();

raw = LOAD '/your/path' USING chukwaLoader AS (ts: long, fields);
bodies = FOREACH raw GENERATE tokenize((chararray)fields#'body') as
tokens, timePeriod(ts) as time;

bodies_this_period = FILTER bodies BY ((chararray)time == '[some timestamp]');

STORE bodies_this_period INTO '/some/output/path'
>From bodies_this_period you can access the different tokens using
$0.token0, bodies_this_period1, etc...

I wrote TOKENIZE to return an ordered tuple of the values found, since
Pig's TOKENIZE returns an unordered bag, which isn't that useful in
this case.


On Mon, Oct 4, 2010 at 2:35 PM, Jerome Boulon <[EMAIL PROTECTED]> wrote:
> Hi Matt,
> When I designed this, the schema was NOT available in Pig. I’m not sure if
> this has changed or not.
> So I’m using the constructor as a way to get around the lack of schema
> definition but if you can get it now from the query & the storage handler
> then it should be a pretty easy thing todo.
> So do you know if the sql schema is now available in Pig?
> /Jerome.
> On 10/4/10 2:28 PM, "Matt Davies" <[EMAIL PROTECTED]> wrote:
> Hey all-
> Trying to do some operations utilizing Chukwa and Pig.  Would like to
> basically
> 1. Read in the data from HDFS
> 2. Do some SPLIT operations
> 3. write the various files out with all the fields as seen during the
> loading phase.
> So, my question is this - is there a way to utilize the
> org.apache.hadoop.chukwa.ChukwaStorage(); engine to load in and then store
> out all the various fields without having to individually define fields in
> the ChukwaStorage constructor?
> Thanks,
> Matt