Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig job is taking more time than Java M/R


Copy link to this message
-
Pig job is taking more time than Java M/R
Hey Guys,

Is there anyway through which I can see the M/R jobs that pig runs
internally for a given pig script ?
I wanted to get unique values for a particular column.

For that I wrote the following script:

Data = Load 'Data.csv' using PigStorage(',');
IDs = FOREACH Data GENERATE $0;
UniqueID = Distinct IDs;
Dump UniqueID;

Is it the write/best way to get unique values of a particular column ?

The reason why I am asking is, I ran the above script on my cluster, it
took around 30 minutes to finish.
However, for the same thing, when I wrote traditional java M/R code, it
took only 10 minutes.

So I want to see what Pig is doing internally.
Can anyone tell what could be the reason for such behaviour ? How can I
decrease Pig Execution time ?

Thanks,
Praveenesh
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB