Sorry.  Meant with MR.  May be more helpful to try and fix the issue there, then see if it carries over to Spark or not since we are not sure if we expect that to work at all.

From: Ben Juhn [mailto:[EMAIL PROTECTED]]
Sent: Monday, July 18, 2016 2:05 PM
To: [EMAIL PROTECTED]
Subject: Re: Processing many map only collections in single pipeline with spark

It’s doing the same thing.  One job shows up in the spark UI at a time.

Thanks,
Ben
On Jul 16, 2016, at 7:29 PM, David Ortiz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

Hmm.  Just out of curiosity, what if you do Pipeline.read in place of readTextFile?

On Sat, Jul 16, 2016, 10:08 PM Ben Juhn <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Nope, it queues up the jobs in series there too.

On Jul 16, 2016, at 6:01 PM, David Ortiz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

*run in parallel

On Sat, Jul 16, 2016, 5:36 PM David Ortiz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Just out of curiosity, if you use mrpipeline does it fun on parallel?  If so, issue may be in spark since I believe crunch leaves it to spark to handle best method of execution.

On Sat, Jul 16, 2016, 4:29 PM Ben Juhn <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Hey David,

I have 100 active executors, each job typically only uses a few.  It’s running on yarn.

Thanks,
Ben

On Jul 16, 2016, at 12:53 PM, David Ortiz <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

What are the cluster resources available vs what a single map uses?

On Sat, Jul 16, 2016, 3:04 PM Ben Juhn <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
I enabled FAIR scheduling hoping that would help but only one job is showing up a time.

Thanks,
Ben

On Jul 15, 2016, at 8:17 PM, Ben Juhn <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

Each input is of a different format, and the DoFn implementation handles them depending on instantiation parameters.

Thanks,
Ben

On Jul 15, 2016, at 7:09 PM, Stephen Durfey <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

Instead of using readTextFile on the pipeline, try using the read method and use the TextFileSource, which can accept in a collection of paths.

https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/TextFileSource.java

On Fri, Jul 15, 2016 at 8:53 PM -0500, "Ben Juhn" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Hello,

I have a job configured the following way:

for (String path : paths) {
    PCollection<String> col = pipeline.readTextFile(path);
    col.parallelDo(new MyDoFn(path), Writables.strings()).write(To.textFile(“out/“ + path), Target.WriteMode.APPEND);
}
pipeline.done();

It results in one spark job for each path, and the jobs run in sequence even though there are no dependencies.  Is it possible to have the jobs run in parallel?

Thanks,

Ben
This email is intended only for the use of the individual(s) to whom it is addressed. If you have received this communication in error, please immediately notify the sender and delete the original email.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB