Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Modifying databag on the fly

Prasanth J 2012-09-06, 01:08
Copy link to this message
Re: Modifying databag on the fly
You cannot modify a bag once it is written.  The implementation is written around the assumption that bags are immutable after they are written.  

Creating a new bag should not create an OOM exception, as bags are built to spill when they grow too large.  In fact it's this spilling feature that makes in place modification impossible.


On Sep 5, 2012, at 6:08 PM, Prasanth J wrote:

> Hello devs
> I have specific case where I need to modify the contents (remove a field from each tuples) of Databag but I want to do it in-place and do not want to create another databag with new set of tuples.
> The situation is, say I have the following input tuple for an UDF
> {(111,222,3,121), (112,223,2,131), (113,224,4,141)}
> I want to iterate through this bag and generate an output bag removing the 3rd the of each tuples in the bag to get the following output
> {(111,222,121), (112,223,131), (113,224,141)}
> Since the number of tuples in this bag are expected to be large I cannot create new set of tuples and create a bag, as this will cause OOM exception.
> Also I do not want to flatten this bag as this bag will be passed to DISTINCT operator for computing distinct elements in the bag.
> As seen from the javadocs for DataBag, there is no way to convert a bag on the fly. I wonder if there is any other way to solve this?
> Thanks
> -- Prasanth
Prasanth J 2012-09-06, 01:30
Alan Gates 2012-09-06, 02:38
Dmitriy Ryaboy 2012-09-08, 06:10