I find that in most cases where I solve Hadoop problems the solution
consists of several jobs chained together. When problem is solved the
solution is almost never wanted in the form of a collection of files named
things like part-r-00000. It is usually the case that even the boundaries
if the files have little to do with Hadoop. A good solution seems to be to
run a last Hadoop job to convert data into a file that others can use.
I am currently working on a problem which can be imagined like this - I
have a large number of 'customers' when the job is done the next stage
wants a series of files containing the customers living in each county, one
file per county in , say a csv format.
If we use the county name as a key one reducer will receive all of the
customers in that county. The reducer opens a HDFS file named for the
county with the task attempt number and .tmp appended, When the key is
finished the file is renamed to the county name with .csv appended.
1) the rename is a small concurrency sin since multiple attempts may
attempt the same rename at the same time,
a) It is unclear whether rename in a HDFS file system succeeds if the
destination path exists - does it?
b) does failure throw an exception or simply return false
2) When one attempt succeeds the others will be killed. These killed tasks
may have open temporary files that should be deleted. Is there code which
will be called as a task is killed, say is cleanup called or some
killcleanup that can delete temporary files
Is there a better way assuming files must be created and reference keys and