Do you want to split on the chukwa payload fields or the fields in the
I have scripts that do similar things with the body using FILTER and a
custom TOKENIZE udf I wrote to tokenize the body content. I'm using
the latest ChukwaLoader for Pig 0.7.0, but the previous one should
work the same way.
define chukwaLoader org.apache.hadoop.chukwa.pig.ChukwaLoader();
define tokenize my.udfs.TOKENIZE();
raw = LOAD '/your/path' USING chukwaLoader AS (ts: long, fields);
bodies = FOREACH raw GENERATE tokenize((chararray)fields#'body') as
tokens, timePeriod(ts) as time;
bodies_this_period = FILTER bodies BY ((chararray)time == '[some timestamp]');
STORE bodies_this_period INTO '/some/output/path'
>From bodies_this_period you can access the different tokens using
$0.token0, bodies_this_period1, etc...
I wrote TOKENIZE to return an ordered tuple of the values found, since
Pig's TOKENIZE returns an unordered bag, which isn't that useful in
On Mon, Oct 4, 2010 at 2:35 PM, Jerome Boulon <[EMAIL PROTECTED]> wrote:
> Hi Matt,
> When I designed this, the schema was NOT available in Pig. I’m not sure if
> this has changed or not.
> So I’m using the constructor as a way to get around the lack of schema
> definition but if you can get it now from the query & the storage handler
> then it should be a pretty easy thing todo.
> So do you know if the sql schema is now available in Pig?
> On 10/4/10 2:28 PM, "Matt Davies" <[EMAIL PROTECTED]> wrote:
> Hey all-
> Trying to do some operations utilizing Chukwa and Pig. Would like to
> 1. Read in the data from HDFS
> 2. Do some SPLIT operations
> 3. write the various files out with all the fields as seen during the
> loading phase.
> So, my question is this - is there a way to utilize the
> org.apache.hadoop.chukwa.ChukwaStorage(); engine to load in and then store
> out all the various fields without having to individually define fields in
> the ChukwaStorage constructor?