Be careful putting them in HDFS. It does not scale very well, as the number of file opens will be on the order of Number of Mappers * Number of Reducers. You can quickly do a denial of service on the namenode if you have a lot of mappers and reducers.
On 5/21/12 4:02 AM, "Harsh J" <[EMAIL PROTECTED]> wrote:
I guess you could write these archives onto HDFS, and have your
reducers read it from a location there, but this method may be a bit
ugly. See http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F
for properly writing files from tasks onto a DFS, or look at
MultipleOutputs API class.
Depending on how large these files are, you can also perhaps ship them
in via the KV pairs itself. A custom key or sort comparator can
further ensure that they are delivered in the first iteration of the
reducer - if the file is required before regular reduce() ops can
On Mon, May 21, 2012 at 1:42 PM, biro lehel <[EMAIL PROTECTED]> wrote:
> Dear all,
> In my Mapper, I run a script that processes my set of input text files, creates from them some other text files (this is done locally on the FS on my nodes), and as a result, each MapTask will produce an archive as a result. My issue is, that I'm looking for a way for the Reducer to "take" these archives as some kind of an input. I understood that the communication between Mapper-Reducer is done through the means of the key-value pairs in the Context, but what I would need is the transferring of these archive files to the respective Reducer (I would probably have one single Reducer, so all the files should be transferred/copied there somehow).
> Is this possible? Is there a way to transfer files from Mapper to Reducer? If not, what is the best approach in scenarios like mine? Any suggestions would be greatly appreciated.
> Thank you in advance,