it may be a stupid question, but in my application I could do without sort
by keys. If only reducers could be told to start their work on the first
maps that they see, my processing would begin to show results much earlier,
before all the mappers are done. Now, eventually, all mappers will have to
finish, so I am not gaining on the total task duration, but only on first
results appearing faster.
Then, if course, I could obtain some intermediates statistics with counters
or with some additional NoSQL database.
I am also concerned about millions of maps that my mappers are emitting -
is that OK? Am I putting too much of a burden on the shuffle stage?