Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> ORDER ... LIMIT failing on large data


Copy link to this message
-
ORDER ... LIMIT failing on large data
I have a small pig script that outputs the top 500 of a simple computed relation. It works fine on a small data set but fails on a larger (45 GB) data set. I don’t see errors in the hadoop logs (but I may be looking in the wrong places). On the large data set the pig log shows

Input(s):
Successfully read 1222894620 records (46581665598 bytes) from: "[...]"

Output(s):
Successfully stored 1 records (3 bytes) in: "hdfs://[...]"

Counters:
Total records written : 1
Total bytes written : 3
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 4640
Total records proactively spilled: 605383326

On the small data set the pig log shows

Input(s):
Successfully read 188865 records (6749318 bytes) from: "[...]"

Output(s):
Successfully stored 500 records (5031 bytes) in: "hdfs://[...]"

Counters:
Total records written : 500
Total bytes written : 5031
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

The script is

cr = load 'data' as
     (
       citeddocid  : int,
       citingdocid : int,
     );
CitedItemsGrpByDocId = group cr by citeddocid;

DedupTCPerDocId      foreach CitedItemsGrpByDocId {
          CitingDocids =  cr.citingdocid;
          UniqCitingDocids = distinct CitingDocids;
          generate group, COUNT(UniqCitingDocids) as tc;
     };

DedupTCPerDocIdSorted = ORDER DedupTCPerDocId by tc DESC;
DedupTCPerDocIdSorted500 = limit DedupTCPerDocIdSorted 500;
store DedupTCPerDocIdSorted500 [...]
I assume I am just doing something grossly inefficiently.  Can some one suggest a better way?  I’m using  Apache Pig version 0.8.1-cdh3u1

Many thanks!

Will

William F Dowling
Senior Technologist

Thomson Reuters

+
Jonathan Coveney 2012-01-05, 22:50
+
william.dowling@... 2012-01-06, 21:11
+
Jonathan Coveney 2012-01-06, 21:35
+
Prashant Kommireddi 2012-01-05, 23:14
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB