Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Task Side Effect files and copying(getWorkOutputPath)


Copy link to this message
-
Task Side Effect files and copying(getWorkOutputPath)
Hello,
I would like to produce side effect files which will be later copied
to the outputfolder.
I am using FileOuputFormat, and in the Map's close() method i copy
files (from the local tmp/ folder) to
FileOutputFormat.getWorkOutputPath(job);

void close() .... {
    if (shouldcopy) {
ArrayList<Path> lop = new ArrayList<Path>();
for(String ff :  tempdir.list()){
   lop.add(new Path(temppfx+ff));
}
dstFS.moveFromLocalFile(lop.toArray(new Path[]{}), dstPath);
   }

However, this throws an error java.io.IOException:
`hdfs://X:54310/tmp/testseq/_temporary/_attempt_200903160945_0010_m_000000_0':
specified destination directory doest not exist

I though this is the right to place to drop side effect files. Prior
to this I was copying o the output folder, but many were not copied,
or in fact all were, but during the reduce output stage many were
deleted - am not sure(I used NullOutputFormat and all the files were
present in the output folder)  So i resorted to getWorkOutputPath
which threw the above exception.

So if I'm using FileOutputFormat, and my maps and/or reduces produce
side effects files on the localFS
1)when should I copy them to the DFS (e.g the close method? or one at
a time in the map/reduce method)
2) Where should i copy them to.

I am using Hadoop 0.19 and have set jobConf.setNumTasksToExecutePerJvm(-1);
Also, each side effect file produced has a unique name, i.e there is
no overwriting.

Thank you
Saptarshi Guha
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB