Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - ORDER ... LIMIT failing on large data


Copy link to this message
-
ORDER ... LIMIT failing on large data
william.dowling@... 2012-01-05, 22:16
I have a small pig script that outputs the top 500 of a simple computed relation. It works fine on a small data set but fails on a larger (45 GB) data set. I don’t see errors in the hadoop logs (but I may be looking in the wrong places). On the large data set the pig log shows

Input(s):
Successfully read 1222894620 records (46581665598 bytes) from: "[...]"

Output(s):
Successfully stored 1 records (3 bytes) in: "hdfs://[...]"

Counters:
Total records written : 1
Total bytes written : 3
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 4640
Total records proactively spilled: 605383326

On the small data set the pig log shows

Input(s):
Successfully read 188865 records (6749318 bytes) from: "[...]"

Output(s):
Successfully stored 500 records (5031 bytes) in: "hdfs://[...]"

Counters:
Total records written : 500
Total bytes written : 5031
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

The script is

cr = load 'data' as
     (
       citeddocid  : int,
       citingdocid : int,
     );
CitedItemsGrpByDocId = group cr by citeddocid;

DedupTCPerDocId      foreach CitedItemsGrpByDocId {
          CitingDocids =  cr.citingdocid;
          UniqCitingDocids = distinct CitingDocids;
          generate group, COUNT(UniqCitingDocids) as tc;
     };

DedupTCPerDocIdSorted = ORDER DedupTCPerDocId by tc DESC;
DedupTCPerDocIdSorted500 = limit DedupTCPerDocIdSorted 500;
store DedupTCPerDocIdSorted500 [...]
I assume I am just doing something grossly inefficiently.  Can some one suggest a better way?  I’m using  Apache Pig version 0.8.1-cdh3u1

Many thanks!

Will

William F Dowling
Senior Technologist

Thomson Reuters

+
Jonathan Coveney 2012-01-05, 22:50
+
william.dowling@... 2012-01-06, 21:11
+
Jonathan Coveney 2012-01-06, 21:35
+
Prashant Kommireddi 2012-01-05, 23:14