Hi,

I'm wondering how Spark is setting the "index" of task?
I'm asking this question because we have a job that constantly fails at
task index = 421.

When increasing number of partitions, this then fails at index=4421.
Increase it a little bit more, now it's 24421.

Our job is as simple as "(1) read json -> (2) group-by sesion identifier ->
(3) write parquet files" and always fails somewhere at step (3) with a
CommitDeniedException. We've identified that some troubles are basically
due to uneven data repartition right after step (2), and now try to go
further in our understanding on how does Spark behaves.

We're using Spark 1.5.2, scala 2.11, on top of hadoop 2.6.0
*Adrien Mogenet*
Head of Backend/Infrastructure
[EMAIL PROTECTED]
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB