Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Data vectorization and pig bag datatype


Copy link to this message
-
Data vectorization and pig bag datatype
Hi All,

We are using Apache Pig for building our data pipeline. We have data in the following fashion:

userid, items {code 1, code 2, ….}, few other features...

Each item has a unique alphanumeric code. I would like to use mahout for clustering it. To vectorize the data, we are represent info on item codes as 1 X M matrix where a column represents an items (1 if a given user has viewed a particular item 0 otherwise) and will have millions of columns. So each user will have id, and this matrix. I am generating the matrix in a Pig UDF.

AU = FOREACH A GENERATE FLATTEN(myparser.myUDF(key, values));

/*Data I get back from my UDF should have the following format: {(userid,1,0,0,1,0,.........)} */

STORE AU into 'vector.out' using $SEQFILE_STORAGE ('-c $INT_CONVERTER', '-c $VECTOR_CONVERTER');

/* Use mahout for analyzing the data */

I am returning a bag from my UDF because the data potentially can have hundreds of millions of items and from my understanding for a tuple everything needs to fit into memory. Is there a better way of doing this? I want to make sure that I am on right track.

     
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB