Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> ORDER ... LIMIT failing on large data


+
william.dowling@... 2012-01-05, 22:16
+
Jonathan Coveney 2012-01-05, 22:50
Copy link to this message
-
RE: ORDER ... LIMIT failing on large data
Thanks Jonathan and Prashant. The immediate cause of the problem I had (failing without erroring out) was slightly different formatting between the small and large input sets. Duh.

When I fixed that, I did indeed get OOM due to the nested distinct. I tried the workaround you suggested Jonathan using two groups, and it worked great!

In a separate run I also tried
  SET pig.exec.nocombiner true;
and found that worked also, and the runtime was the same as using the two group circumlocution.

Thanks again for your help.

Will
William F Dowling
Senior Technologist
Thomson Reuters
-----Original Message-----
From: Jonathan Coveney [mailto:[EMAIL PROTECTED]]
Sent: Thursday, January 05, 2012 5:51 PM
To: [EMAIL PROTECTED]
Subject: Re: ORDER ... LIMIT failing on large data

Nested distincts are dangerous. They are not done in a distributed fashion,
they have to be loaded into memory. So that is what is killing it, not the
order/limit.

The alternative is to do two groups, first group by
(citeddocid,CitingDocids) to get the distinct and then by citeddocid. to
get the count

2012/1/5 <[EMAIL PROTECTED]>

> I have a small pig script that outputs the top 500 of a simple computed
> relation. It works fine on a small data set but fails on a larger (45 GB)
> data set. I don’t see errors in the hadoop logs (but I may be looking in
> the wrong places). On the large data set the pig log shows
>
> Input(s):
> Successfully read 1222894620 records (46581665598 bytes) from: "[...]"
>
> Output(s):
> Successfully stored 1 records (3 bytes) in: "hdfs://[...]"
>
> Counters:
> Total records written : 1
> Total bytes written : 3
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 4640
> Total records proactively spilled: 605383326
>
> On the small data set the pig log shows
>
> Input(s):
> Successfully read 188865 records (6749318 bytes) from: "[...]"
>
> Output(s):
> Successfully stored 500 records (5031 bytes) in: "hdfs://[...]"
>
> Counters:
> Total records written : 500
> Total bytes written : 5031
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
>
> The script is
>
> cr = load 'data' as
>     (
>       citeddocid  : int,
>       citingdocid : int,
>     );
> CitedItemsGrpByDocId = group cr by citeddocid;
>
> DedupTCPerDocId >     foreach CitedItemsGrpByDocId {
>          CitingDocids =  cr.citingdocid;
>          UniqCitingDocids = distinct CitingDocids;
>          generate group, COUNT(UniqCitingDocids) as tc;
>     };
>
> DedupTCPerDocIdSorted = ORDER DedupTCPerDocId by tc DESC;
> DedupTCPerDocIdSorted500 = limit DedupTCPerDocIdSorted 500;
> store DedupTCPerDocIdSorted500 [...]
>
>
> I assume I am just doing something grossly inefficiently.  Can some one
> suggest a better way?  I’m using  Apache Pig version 0.8.1-cdh3u1
>
> Many thanks!
>
> Will
>
> William F Dowling
> Senior Technologist
>
> Thomson Reuters
>
>
>
>
+
Jonathan Coveney 2012-01-06, 21:35
+
Prashant Kommireddi 2012-01-05, 23:14