Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Sequence Processing in Pig

Copy link to this message
Re: Sequence Processing in Pig
Andy Schlaikjer 2012-02-22, 18:45
Hi Chris,

Not sure about your options in Jython, but it's pretty easy to do this
in native Pig:

1. Create custom EvalFunc UDF whose ctor takes (String) argument K,
the expected size of output tuples, and parses out the int and stores
in private field.
2. Override the UDF's outputSchema method and use the value of K to
construct valid output schema of appropriate size.


On Tue, Feb 21, 2012 at 5:55 PM, Chris Diehl <[EMAIL PROTECTED]> wrote:
> Hello,
> I've got some activity data that I've processed with Pig to generate a
> sequence of bags, one per user, that each contain a set of tuples of the
> form (timestamp, activity id) that are ordered in time.
> From each bag, I would like to produce a new bag of k-tuples where each
> k-tuple contains a consecutive sequence of k activities from the original
> ordered bag. For my initial stab at this, I wrote the following Jython UDF:
> @outputSchemaFunction("schema")
> def k_tuple_expansion(activities,k):
>    """Scans through a time ordered bag of tuples of the form
> (timestamp,activity id)
>    for a given user and returns a bag of k-tuples of all activity
> sequences of length k.
>    """
>    tups = []
>    for i in xrange(k-1,len(activities)):
>        actseq = [activities[i-j][1] for j in range(k-1,-1,-1)]
>        tups.append(tuple(actseq))
>    return tups
> @schemaFunction("schema")
> def schema(input):
>    # Return whatever type we were handed
>    return input
> This code works appropriately. Problems arise when I try to process these
> results further in Pig. Given I'm not specifying a static output schema,
> since it is a function of k, Pig doesn't readily know what is being
> returned.
> My attempt to flatten each user bag of k-tuples is failing.
> Given I could construct a string representation of the output schema once k
> is known, is there some way to construct the string and pass it back? My
> use of the schema function above follows the only example I've seen here.
> https://cwiki.apache.org/PIG/udfsusingscriptinglanguages.html
> If the Jython UDF approach is not the best, is there a native Pig approach
> to attacking this problem?
> Any pointers would be most appreciated!
> Chris