Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Total count of RandomSampleLoader is unpredicatable


Copy link to this message
-
Re: Total count of RandomSampleLoader is unpredicatable
Not sure if it's the same issue, but I also see the counter of Map
input records is greater than the actual number of input records in
some cases.

Jie

On Thu, Jul 26, 2012 at 6:04 PM, Prasanth J <[EMAIL PROTECTED]> wrote:
> Hello everyone
>
> I am using RandomSampleLoader to load 1000 tuples per mapper. I have 11 map jobs in a small dataset and 109 map jobs in a large dataset.
>
> I am expecting 11000 tuples from the small dataset and 109000 tuples from the large dataset. But the actual number of tuples that I get is always more than what I expected. In small dataset case I am getting 15000 tuples whereas in large dataset case I am getting 145000 (sometimes 150000) tuples.
>
> Is this a bug? or is it an expected behavior? If reservoir sampling is used by all mappers then why is the number of total samples is more?
>
> Thanks
> -- Prasanth
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB