Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Batching transformations in Pig


Copy link to this message
-
Batching transformations in Pig
Hi all,

I'm wondering if anyone has experience with my following scenario:

I have a HBase table loaded with millions of records. I want to load these records into Pig, process each batch of 1000 by calling an external API, and then associate the results with the Tuples in each batch. I've having difficulty figuring out how to do this in Pig (via a UDF) if this is even possible or a valid scenario for Pig. In a nutshell, this is what I want:

[Input]
ID1,A,B
ID2,C,D
ID3,E,F
ID4,G,H
...
IDn,Y,Z

For example purposes, let's say batch size is 2. So, I'd like ID1 and ID2 to be batched together, call the external API, and then have the returned Tuples to include the data from the API call. Similarly, ID3 and ID4 are batched together, call the external API, and the returned Tuples have the data from API. So, I'd like my output to be:

[Output]
ID1,A,B,1,2
ID2,C,D,3,4
ID3,E,F,5
ID4,G,H,6,7,8
...
IDn,Y,Z

Yes, I can call the API per record, but I want to reduce the # of API calls, thus, I'd like to batch of set of records and then call the API.

Hope this makes sense. Is this possible via Pig with UDF?

-Terry

PS: I did try implementing an accumulator by grouping my records via a single constant value, hoping that the accumulate() and getValue() are called per batch with no luck.

+
Dmitriy Ryaboy 2012-09-13, 09:48
+
Terry Siu 2012-09-13, 16:39
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB