I have been thinking recently about various functions I plan to implement
in MapReduce. One common theme is that I have many reducers to increase
the parallelism and so the performance. However this means that I potentially
end up with many separate files in HDFS, one from each reducer. This isn't necessarily a problem in
and of itself. However it's a little awkward having lots of little files containing
the contents of one logical file. And also, presumably, it's extra load on the
name-server. (Of course I could have one output file by having a single reducer
but this seems to defeat the purpose of having a parallel solution.)
So this led me to wondering what would be the best way to concatenate all
these HDFS files into one file on HDFS. There doesn't seem to be a FSShell command
to do this. I could write something in Java just using the FSDataStream classes
to open all the input files and write to the single output file.
But this seemed a little primitive and with little scope for parallelism.
Also it would seem sensible to have some
way of doing this which would avoid the need to hold two copies of the data
in HDFS. Or something which merged at the block level and so never needed to
hold more than one or two blocks as overhead.
I see that there's HDFS-222 but that seems to be targeted at the case
where all of the files, bar the last one, consist of an integral number
of whole blocks. Which doesn't apply in my case.
Is there already something to do this or have I missed something?
Trillium Software UK Limited