Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Modifying databag on the fly

Copy link to this message
Modifying databag on the fly
Hello devs

I have specific case where I need to modify the contents (remove a field from each tuples) of Databag but I want to do it in-place and do not want to create another databag with new set of tuples.
The situation is, say I have the following input tuple for an UDF

{(111,222,3,121), (112,223,2,131), (113,224,4,141)}

I want to iterate through this bag and generate an output bag removing the 3rd the of each tuples in the bag to get the following output
{(111,222,121), (112,223,131), (113,224,141)}

Since the number of tuples in this bag are expected to be large I cannot create new set of tuples and create a bag, as this will cause OOM exception.

Also I do not want to flatten this bag as this bag will be passed to DISTINCT operator for computing distinct elements in the bag.
As seen from the javadocs for DataBag, there is no way to convert a bag on the fly. I wonder if there is any other way to solve this?

-- Prasanth