Even after reading O'reillys book on hadoop I don't feel like I have a clear
vision of how the map tasks get assigned.
They depend on splits right?
But I have 3 jobs running. And splits will come from various sources: HDFS,
S3, and slow HTTP sources.
So I've got some concern as to how the map tasks will be distributed to
handle the data acquisition.
Can I do anything to ensure that I don't let the cluster go idle processing
slow HTTP downloads when the boxes could simultaneously be doing HTTP
downloads for one job and reading large files off HDFS for another job?
I'm imagining a scenario where the only map tasks that are running are all
blocking on splits requiring HTTP downloads and the splits coming from HDFS
are all queuing up behind it, when they'd run more efficiently in parallel