Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Number of mappers in MRCompiler


+
Prasanth J 2012-07-25, 22:47
+
Alan Gates 2012-07-25, 22:49
+
Prasanth J 2012-07-25, 23:28
Copy link to this message
-
Re: Number of mappers in MRCompiler
Sorry for the very slow response, but here it is, hopefully better late than never.

On Jul 25, 2012, at 4:28 PM, Prasanth J wrote:

> Thanks Alan.
> The requirement for me is that I want to load N number of samples based on the input file size and perform naive cube computation to determine the large groups that will not fit in reducer's memory. I need to know the exact number of samples for calculating the partition factor for large groups.
> Currently I am using RandomSampleLoader to load 1000 tuples from each mapper. Without knowing the number of mappers I will not be able to find the exact number of samples loaded. Also RandomSampleLoader doesn't attach any special marker (as in PoissonSampleLoader) tuples which tells the number of samples loaded.
> Is there any other way to know the exact number of samples loaded?
Not that I know of.

>
> By analyzing the MR plans of order-by and skewed-join, it seems like the entire dataset is copied to a temp file and then SampleLoaders use the temp file to load samples. Is there any specific reason for this redundant copy? Is it because SampleLoaders can only use pig's internal i/o format?
Partly, but also because it allows any operators that need to run before the sample (like project or filter) to be placed in the pipeline.

Alan.
+
Prasanth J 2012-08-24, 02:55
+
Dmitriy Ryaboy 2012-08-24, 03:20
+
Prasanth J 2012-08-24, 03:23
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB