Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Data vectorization and pig bag datatype


Copy link to this message
-
Data vectorization and pig bag datatype
Hi All,

We are using Apache Pig for building our data pipeline. We have data in the following fashion:

userid, items {code 1, code 2, ….}, few other features...

Each item has a unique alphanumeric code. I would like to use mahout for clustering it. To vectorize the data, we are represent info on item codes as 1 X M matrix where a column represents an items (1 if a given user has viewed a particular item 0 otherwise) and will have millions of columns. So each user will have id, and this matrix. I am generating the matrix in a Pig UDF.

AU = FOREACH A GENERATE FLATTEN(myparser.myUDF(key, values));

/*Data I get back from my UDF should have the following format: {(userid,1,0,0,1,0,.........)} */

STORE AU into 'vector.out' using $SEQFILE_STORAGE ('-c $INT_CONVERTER', '-c $VECTOR_CONVERTER');

/* Use mahout for analyzing the data */

I am returning a bag from my UDF because the data potentially can have hundreds of millions of items and from my understanding for a tuple everything needs to fit into memory. Is there a better way of doing this? I want to make sure that I am on right track.