Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Limit number of Streaming Programs


Copy link to this message
-
Limit number of Streaming Programs
Hi,

I have around 4 million time series. ~1000 of them had a special
occurrence at some point. Now, I want to draw 10 samples for each
special time-series based on a similarity comparison.

What I have currently implemented is a script in Python which consumes
time-series one-by-one and does a comparison with all 1000 special
time-series. If the similarity is sufficient with one of them I pass
it back to Pig and strike out the according special time-series,
subsequent time-series will not be compared against this one.

This routine runs, but it lasts around 6 hours.

One of the problems I'm facing is that Pig starts >160 scripts
although 10 would be sufficient. Is there some way to define the
number of scripts Pig starts in a `STREAM THROUGH` step? I tried to
set default_parallel to 10, but it doesn't seem to have any effect.

I'm also open to any other ideas on how to accomplish the task.

Regards,
Thomas Bach.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB