Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Limit number of Streaming Programs


Copy link to this message
-
Limit number of Streaming Programs
Hi,

I have around 4 million time series. ~1000 of them had a special
occurrence at some point. Now, I want to draw 10 samples for each
special time-series based on a similarity comparison.

What I have currently implemented is a script in Python which consumes
time-series one-by-one and does a comparison with all 1000 special
time-series. If the similarity is sufficient with one of them I pass
it back to Pig and strike out the according special time-series,
subsequent time-series will not be compared against this one.

This routine runs, but it lasts around 6 hours.

One of the problems I'm facing is that Pig starts >160 scripts
although 10 would be sufficient. Is there some way to define the
number of scripts Pig starts in a `STREAM THROUGH` step? I tried to
set default_parallel to 10, but it doesn't seem to have any effect.

I'm also open to any other ideas on how to accomplish the task.

Regards,
Thomas Bach.