Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Java heap error


Copy link to this message
-
Re: Java heap error
Thejas M Nair 2010-07-23, 20:15
Hi Syed,
I think the problem you faced is same as what is present in the newly created jira - https://issues.apache.org/jira/browse/PIG-1516 .

As a workaround, you can disable the combiner (See above jira). This is what you have done indirectly, by using a new sum udf that does not implement the algebraic interface.
I will be submitting a patch soon for the 0.8 release.

-Thejas
On 7/9/10 4:01 PM, "Syed Wasti" <[EMAIL PROTECTED]> wrote:

Yes Ashutosh, that is the case and here the code for the UDF. Let me know
what you find.

public class GroupSum extends EvalFunc<DataBag> {
    TupleFactory mTupleFactory;
    BagFactory mBagFactory;

    public GroupSum() {
        this.mTupleFactory = TupleFactory.getInstance();
        this.mBagFactory = BagFactory.getInstance();
    }

    public DataBag exec(Tuple input) throws IOException {
        if (input.size() < 0) {
            int errCode = 2107;
            String msg = "GroupSum expects one input but received "
                    + input.size()
                    + " inputs. \n";
            throw new ExecException(msg, errCode);
        }
        try {
            DataBag output = this.mBagFactory.newDefaultBag();
            Object o1 = input.get(0);
            if (o1 instanceof DataBag) {
                DataBag bag1 = (DataBag) o1;
                if (bag1.size() == 1L) {
                    return bag1;
                }
                sumBag(bag1, output);
            }
            return output;
        } catch (ExecException ee) {
            throw ee;
        }
    }

    private void sumBag(DataBag o1, DataBag emitTo) throws IOException {
        Iterator<?> i1 = o1.iterator();
        Tuple row = null;
        Tuple firstRow = null;;

        int fld1 = 0, fld2 = 0, fld3 = 0, fld4 = 0, fld5 = 0;
        int cnt = 0;
        while (i1.hasNext()) {
            row = (Tuple) i1.next();
            if (cnt == 0) {
                firstRow = row;
            }
            fld1 += (Integer) row.get(1);
            fld2 += (Integer) row.get(2);
            fld3 += (Integer) row.get(3);
            fld4 += (Integer) row.get(4);
            fld5 += (Integer) row.get(5);
            cnt ++;
        }
        //field 0 has the id in it.
        firstRow.set(1, fld1);
        firstRow.set(2, fld2);
        firstRow.set(3, fld3);
        firstRow.set(4, fld4);
        firstRow.set(5, fld5);
        emitTo.add(firstRow);
    }

    public Schema outputSchema(Schema input) {
        try {
            Schema tupleSchema = new Schema();
            tupleSchema.add(input.getField(0));
            tupleSchema.setTwoLevelAccessRequired(true);
            return tupleSchema;
        } catch (Exception e) {
        }
        return null;
    }
}
On 7/9/10 2:32 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote:

> Hi Syed,
>
> Do you mean your query fails with OOME if you use Pig's builtin SUM,
> but succeeds if you use your own SUM UDF? If that is so, thats
> interesting.  I have a hunch, why that is the case, but would like to
> confirm. Would you mind sharing your SUM UDF.
>
> Ashutosh
> On Fri, Jul 9, 2010 at 12:50, Syed Wasti <[EMAIL PROTECTED]> wrote:
>> Hi Ashutosh,
>> Did not try option 2 and 3, I shall work sometime next week on that.
>> But increasing the heap size did not help initially, with the increased heap
>> size I came up with a UDF to do the SUM on the grouped data for the last
>> step in my script and it completes my query without any errors now.
>>
>> Syed
>>
>>
>> On 7/8/10 5:58 PM, "Ashutosh Chauhan" <[EMAIL PROTECTED]> wrote:
>>
>>> Aah.. forgot to tell how to set that param  in 3). While launching
>>> pig, provide it as -D cmd line switch, as follows:
>>> pig -Dpig.cachedbag.memusage=0.02f myscript.pig
>>>
>>> On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan
>>> <[EMAIL PROTECTED]> wrote:
>>>> I will recommend following things in the order:
>>>>
>>>> 1) Increasing heap size should help.
>>>> 2) It seems you are on 0.7. There are couple of memory fixes we have
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
)
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
)
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136>>>>>>>
)
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130>>>>>>>
)