Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> ORDER ... LIMIT failing on large data


Copy link to this message
-
RE: ORDER ... LIMIT failing on large data
Thanks Jonathan and Prashant. The immediate cause of the problem I had (failing without erroring out) was slightly different formatting between the small and large input sets. Duh.

When I fixed that, I did indeed get OOM due to the nested distinct. I tried the workaround you suggested Jonathan using two groups, and it worked great!

In a separate run I also tried
  SET pig.exec.nocombiner true;
and found that worked also, and the runtime was the same as using the two group circumlocution.

Thanks again for your help.

Will
William F Dowling
Senior Technologist
Thomson Reuters
-----Original Message-----
From: Jonathan Coveney [mailto:[EMAIL PROTECTED]]
Sent: Thursday, January 05, 2012 5:51 PM
To: [EMAIL PROTECTED]
Subject: Re: ORDER ... LIMIT failing on large data

Nested distincts are dangerous. They are not done in a distributed fashion,
they have to be loaded into memory. So that is what is killing it, not the
order/limit.

The alternative is to do two groups, first group by
(citeddocid,CitingDocids) to get the distinct and then by citeddocid. to
get the count

2012/1/5 <[EMAIL PROTECTED]>

> I have a small pig script that outputs the top 500 of a simple computed
> relation. It works fine on a small data set but fails on a larger (45 GB)
> data set. I don’t see errors in the hadoop logs (but I may be looking in
> the wrong places). On the large data set the pig log shows
>
> Input(s):
> Successfully read 1222894620 records (46581665598 bytes) from: "[...]"
>
> Output(s):
> Successfully stored 1 records (3 bytes) in: "hdfs://[...]"
>
> Counters:
> Total records written : 1
> Total bytes written : 3
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 4640
> Total records proactively spilled: 605383326
>
> On the small data set the pig log shows
>
> Input(s):
> Successfully read 188865 records (6749318 bytes) from: "[...]"
>
> Output(s):
> Successfully stored 500 records (5031 bytes) in: "hdfs://[...]"
>
> Counters:
> Total records written : 500
> Total bytes written : 5031
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
>
> The script is
>
> cr = load 'data' as
>     (
>       citeddocid  : int,
>       citingdocid : int,
>     );
> CitedItemsGrpByDocId = group cr by citeddocid;
>
> DedupTCPerDocId >     foreach CitedItemsGrpByDocId {
>          CitingDocids =  cr.citingdocid;
>          UniqCitingDocids = distinct CitingDocids;
>          generate group, COUNT(UniqCitingDocids) as tc;
>     };
>
> DedupTCPerDocIdSorted = ORDER DedupTCPerDocId by tc DESC;
> DedupTCPerDocIdSorted500 = limit DedupTCPerDocIdSorted 500;
> store DedupTCPerDocIdSorted500 [...]
>
>
> I assume I am just doing something grossly inefficiently.  Can some one
> suggest a better way?  I’m using  Apache Pig version 0.8.1-cdh3u1
>
> Many thanks!
>
> Will
>
> William F Dowling
> Senior Technologist
>
> Thomson Reuters
>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB