Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Batching transformations in Pig


Copy link to this message
-
RE: Batching transformations in Pig
Thanks, Dmitriy, that's eventually what I concluded that I needed to do.

-Terry

-----Original Message-----
From: Dmitriy Ryaboy [mailto:[EMAIL PROTECTED]]
Sent: Thursday, September 13, 2012 2:49 AM
To: [EMAIL PROTECTED]
Subject: Re: Batching transformations in Pig

Group, and pass the grouped sets to your batch-processing UDF?

so:

data:
id1 bucket1
id2 bucket2
id3 bucket2
id4 bucket1

bucketized = group data by bucket_id;

bucket1, { (id1, id4) }
bucket2, { (id2, id3) }

batch_processed = foreach bucketized generate MyUDF(data);
D

On Wed, Sep 12, 2012 at 11:55 AM, Terry Siu <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I'm wondering if anyone has experience with my following scenario:
>
> I have a HBase table loaded with millions of records. I want to load these records into Pig, process each batch of 1000 by calling an external API, and then associate the results with the Tuples in each batch. I've having difficulty figuring out how to do this in Pig (via a UDF) if this is even possible or a valid scenario for Pig. In a nutshell, this is what I want:
>
> [Input]
> ID1,A,B
> ID2,C,D
> ID3,E,F
> ID4,G,H
> ...
> IDn,Y,Z
>
> For example purposes, let's say batch size is 2. So, I'd like ID1 and ID2 to be batched together, call the external API, and then have the returned Tuples to include the data from the API call. Similarly, ID3 and ID4 are batched together, call the external API, and the returned Tuples have the data from API. So, I'd like my output to be:
>
> [Output]
> ID1,A,B,1,2
> ID2,C,D,3,4
> ID3,E,F,5
> ID4,G,H,6,7,8
> ...
> IDn,Y,Z
>
> Yes, I can call the API per record, but I want to reduce the # of API calls, thus, I'd like to batch of set of records and then call the API.
>
> Hope this makes sense. Is this possible via Pig with UDF?
>
> -Terry
>
> PS: I did try implementing an accumulator by grouping my records via a single constant value, hoping that the accumulate() and getValue() are called per batch with no luck.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB