Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Limit number of Streaming Programs


+
Thomas Bach 2012-12-18, 20:00
Copy link to this message
-
Re: Limit number of Streaming Programs
Hi Thomas,

If I understand your question correctly, what you want is reduce the number
of mappers that spawn streaming processes. The default-parallel controls
the number of reducers, so it won't have any effect to the number of
mappers. Although the number of mappers is auto-determined by the size of
input data, you can try to set "pig.maxCombinedSplitSize" to combine input
files into bigger ones. For more details, please refer to:
http://pig.apache.org/docs/r0.10.0/perf.html#combine-files

You can also read a discussion on a similar topic here:
http://search-hadoop.com/m/J5hCw1UdxTa/How+can+I+set+the+mapper+number&subj=How+can+I+set+the+mapper+number+for+pig+script+

Thanks,
Cheolsoo
On Tue, Dec 18, 2012 at 12:00 PM, Thomas Bach
<[EMAIL PROTECTED]>wrote:

> Hi,
>
> I have around 4 million time series. ~1000 of them had a special
> occurrence at some point. Now, I want to draw 10 samples for each
> special time-series based on a similarity comparison.
>
> What I have currently implemented is a script in Python which consumes
> time-series one-by-one and does a comparison with all 1000 special
> time-series. If the similarity is sufficient with one of them I pass
> it back to Pig and strike out the according special time-series,
> subsequent time-series will not be compared against this one.
>
> This routine runs, but it lasts around 6 hours.
>
> One of the problems I'm facing is that Pig starts >160 scripts
> although 10 would be sufficient. Is there some way to define the
> number of scripts Pig starts in a `STREAM THROUGH` step? I tried to
> set default_parallel to 10, but it doesn't seem to have any effect.
>
> I'm also open to any other ideas on how to accomplish the task.
>
> Regards,
>         Thomas Bach.
>
+
Kshiva Kps 2012-12-25, 05:39
+
Prasanth J 2012-12-25, 12:46
+
Mohammad Tariq 2012-12-25, 05:49
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB