Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Sequence Processing in Pig


Copy link to this message
-
Re: Sequence Processing in Pig
Hi Chris,

Not sure about your options in Jython, but it's pretty easy to do this
in native Pig:

1. Create custom EvalFunc UDF whose ctor takes (String) argument K,
the expected size of output tuples, and parses out the int and stores
in private field.
2. Override the UDF's outputSchema method and use the value of K to
construct valid output schema of appropriate size.

Andy
@sagemintblue

On Tue, Feb 21, 2012 at 5:55 PM, Chris Diehl <[EMAIL PROTECTED]> wrote:
> Hello,
>
> I've got some activity data that I've processed with Pig to generate a
> sequence of bags, one per user, that each contain a set of tuples of the
> form (timestamp, activity id) that are ordered in time.
> From each bag, I would like to produce a new bag of k-tuples where each
> k-tuple contains a consecutive sequence of k activities from the original
> ordered bag. For my initial stab at this, I wrote the following Jython UDF:
>
> @outputSchemaFunction("schema")
> def k_tuple_expansion(activities,k):
>    """Scans through a time ordered bag of tuples of the form
> (timestamp,activity id)
>    for a given user and returns a bag of k-tuples of all activity
> sequences of length k.
>    """
>    tups = []
>    for i in xrange(k-1,len(activities)):
>        actseq = [activities[i-j][1] for j in range(k-1,-1,-1)]
>        tups.append(tuple(actseq))
>    return tups
>
> @schemaFunction("schema")
> def schema(input):
>    # Return whatever type we were handed
>    return input
>
> This code works appropriately. Problems arise when I try to process these
> results further in Pig. Given I'm not specifying a static output schema,
> since it is a function of k, Pig doesn't readily know what is being
> returned.
> My attempt to flatten each user bag of k-tuples is failing.
>
> Given I could construct a string representation of the output schema once k
> is known, is there some way to construct the string and pass it back? My
> use of the schema function above follows the only example I've seen here.
> https://cwiki.apache.org/PIG/udfsusingscriptinglanguages.html
>
> If the Jython UDF approach is not the best, is there a native Pig approach
> to attacking this problem?
>
> Any pointers would be most appreciated!
>
> Chris
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB