Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Java heap error


Copy link to this message
-
Re: Java heap error
Hi Syed,
I think the problem you faced is same as what is present in the newly created jira - https://issues.apache.org/jira/browse/PIG-1516 .

As a workaround, you can disable the combiner (See above jira). This is what you have done indirectly, by using a new sum udf that does not implement the algebraic interface.
I will be submitting a patch soon for the 0.8 release.

-Thejas
On 7/9/10 4:01 PM, "Syed Wasti" <[EMAIL PROTECTED]> wrote:

Yes Ashutosh, that is the case and here the code for the UDF. Let me know
what you find.

public class GroupSum extends EvalFunc<DataBag> {
    TupleFactory mTupleFactory;
    BagFactory mBagFactory;

    public GroupSum() {
        this.mTupleFactory = TupleFactory.getInstance();
        this.mBagFactory = BagFactory.getInstance();
    }

    public DataBag exec(Tuple input) throws IOException {
        if (input.size() < 0) {
            int errCode = 2107;
            String msg = "GroupSum expects one input but received "
                    + input.size()
                    + " inputs. \n";
            throw new ExecException(msg, errCode);
        }
        try {
            DataBag output = this.mBagFactory.newDefaultBag();
            Object o1 = input.get(0);
            if (o1 instanceof DataBag) {
                DataBag bag1 = (DataBag) o1;
                if (bag1.size() == 1L) {
                    return bag1;
                }
                sumBag(bag1, output);
            }
            return output;
        } catch (ExecException ee) {
            throw ee;
        }
    }

    private void sumBag(DataBag o1, DataBag emitTo) throws IOException {
        Iterator<?> i1 = o1.iterator();
        Tuple row = null;
        Tuple firstRow = null;;

        int fld1 = 0, fld2 = 0, fld3 = 0, fld4 = 0, fld5 = 0;
        int cnt = 0;
        while (i1.hasNext()) {
            row = (Tuple) i1.next();
            if (cnt == 0) {
                firstRow = row;
            }
            fld1 += (Integer) row.get(1);
            fld2 += (Integer) row.get(2);
            fld3 += (Integer) row.get(3);
            fld4 += (Integer) row.get(4);
            fld5 += (Integer) row.get(5);
            cnt ++;
        }
        //field 0 has the id in it.
        firstRow.set(1, fld1);
        firstRow.set(2, fld2);
        firstRow.set(3, fld3);
        firstRow.set(4, fld4);
        firstRow.set(5, fld5);
        emitTo.add(firstRow);
    }

    public Schema outputSchema(Schema input) {
        try {
            Schema tupleSchema = new Schema();
            tupleSchema.add(input.getField(0));
            tupleSchema.setTwoLevelAccessRequired(true);
            return tupleSchema;
        } catch (Exception e) {
        }
        return null;
    }
}
On 7/9/10 2:32 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote:

> Hi Syed,
>
> Do you mean your query fails with OOME if you use Pig's builtin SUM,
> but succeeds if you use your own SUM UDF? If that is so, thats
> interesting.  I have a hunch, why that is the case, but would like to
> confirm. Would you mind sharing your SUM UDF.
>
> Ashutosh
> On Fri, Jul 9, 2010 at 12:50, Syed Wasti <[EMAIL PROTECTED]> wrote:
>> Hi Ashutosh,
>> Did not try option 2 and 3, I shall work sometime next week on that.
>> But increasing the heap size did not help initially, with the increased heap
>> size I came up with a UDF to do the SUM on the grouped data for the last
>> step in my script and it completes my query without any errors now.
>>
>> Syed
>>
>>
>> On 7/8/10 5:58 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote:
>>
>>> Aah.. forgot to tell how to set that param  in 3). While launching
>>> pig, provide it as -D cmd line switch, as follows:
>>> pig -Dpig.cachedbag.memusage=0.02f myscript.pig
>>>
>>> On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan
>>> <[EMAIL PROTECTED]> wrote:
>>>> I will recommend following things in the order:
>>>>
>>>> 1) Increasing heap size should help.
>>>> 2) It seems you are on 0.7. There are couple of memory fixes we have
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
)
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
)
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
)
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB