-Batching transformations in Pig
Terry Siu 2012-09-12, 18:55
I'm wondering if anyone has experience with my following scenario:
I have a HBase table loaded with millions of records. I want to load these records into Pig, process each batch of 1000 by calling an external API, and then associate the results with the Tuples in each batch. I've having difficulty figuring out how to do this in Pig (via a UDF) if this is even possible or a valid scenario for Pig. In a nutshell, this is what I want:
For example purposes, let's say batch size is 2. So, I'd like ID1 and ID2 to be batched together, call the external API, and then have the returned Tuples to include the data from the API call. Similarly, ID3 and ID4 are batched together, call the external API, and the returned Tuples have the data from API. So, I'd like my output to be:
Yes, I can call the API per record, but I want to reduce the # of API calls, thus, I'd like to batch of set of records and then call the API.
Hope this makes sense. Is this possible via Pig with UDF?
PS: I did try implementing an accumulator by grouping my records via a single constant value, hoping that the accumulate() and getValue() are called per batch with no luck.