Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> IOException when using MultipleSequenceFileOutputFormat


+
Jason Yang 2012-09-17, 13:50
+
Harsh J 2012-09-18, 02:38
+
Jason Yang 2012-09-18, 03:44
+
Harsh J 2012-09-18, 03:59
+
Jason Yang 2012-09-18, 06:07
Copy link to this message
-
Re: IOException when using MultipleSequenceFileOutputFormat
>> If I use the SequenceFileOutputFormat instead of
MultipleSequenceFileOutputFormat, this program would works fine( at least
there is no error in log). <<

I might suggest another alternative fix.... Maybe your ulimit is too low in
your psuedodistributed OS?  The fact that you are using a clustering output
means you will have some funny files --- maybe alot of very small ones, and
possibly lots of them, more than you normally would distribute to a single
node, as Harsh suggests..

On Mon, Sep 17, 2012 at 10:38 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> Hi Jason,
>
> How many unique keys are you going to be generating from this program,
> roughly?
>
> By default, the max-load of a DN is about 4k threads and if you're
> trying to push beyond that value then the NN will no longer select the
> DN as it would consider it already overloaded. In a fully distributed
> mode, you may not see this issue as there's several DNs and TTs to
> distribute the write load across.
>
> Try with a smaller input sample if there's a whole lot of keys you'll
> be creating files for, and see if that works instead (such that
> there's fewer files and you do not hit the xceiver/load limits).
>
> On Mon, Sep 17, 2012 at 7:20 PM, Jason Yang <[EMAIL PROTECTED]>
> wrote:
> > Hi, all
> >
> > I have written a simple MR program which partition a file into multiple
> > files bases on the clustering result of the points in this file, here is
> my
> > code:
> > ---
> > private int run() throws IOException
> > {
> > String scheme = getConf().get(CommonUtility.ATTR_SCHEME);
> > String ecgDir = getConf().get(CommonUtility.ATTR_ECG_DATA_DIR);
> > String outputDir = getConf().get(CommonUtility.ATTR_OUTPUT_DIR);
> >
> > // create JobConf
> > JobConf jobConf = new JobConf(getConf(), this.getClass());
> >
> > // set path for input and output
> > Path inPath = new Path(scheme + ecgDir);
> > Path outPath = new Path(scheme + outputDir +
> > CommonUtility.OUTPUT_LOCAL_CLUSTERING);
> > FileInputFormat.addInputPath(jobConf, inPath);
> > FileOutputFormat.setOutputPath(jobConf, outPath);
> >
> > // clear output if it already existed
> > CommonUtility.deleteHDFSFile(outPath.toString());
> >
> > // set format for input and output
> > jobConf.setInputFormat(WholeFileInputFormat.class);
> > jobConf.setOutputFormat(LocalClusterMSFOutputFormat.class);
> >
> > // set class of output key and value
> > jobConf.setOutputKeyClass(Text.class);
> > jobConf.setOutputValueClass(RRIntervalWritable.class);
> >
> > // set mapper and reducer
> > jobConf.setMapperClass(LocalClusteringMapper.class);
> > jobConf.setReducerClass(IdentityReducer.class);
> >
> >
> > // run the job
> > JobClient.runJob(jobConf);
> > return 0;
> > }
> >
> > ...
> >
> > public class LocalClusteringMapper extends MapReduceBase implements
> > Mapper<NullWritable, BytesWritable, Text, RRIntervalWritable>
> > {
> > @Override
> > public void map(NullWritable key, BytesWritable value,
> > OutputCollector<Text, RRIntervalWritable> output, Reporter reporter)
> > throws IOException
> > {
> > //read and cluster
> >                   ...
> >
> > // output
> > Iterator<RRIntervalWritable> it = rrArray.iterator();
> > while (it.hasNext())
> > {
> > RRIntervalWritable rr = it.next();
> >
> > Text outputKey = new Text(rr.clusterResult );
> >
> > output.collect(outputKey, rr);
> > }
> >
> > }
> >
> > ...
> >
> > public class LocalClusterMSFOutputFormat extends
> > MultipleSequenceFileOutputFormat<Text, RRIntervalWritable>
> > {
> >
> > protected String generateFileNameForKeyValue(Text key,
> > RRIntervalWritable value, String name)
> > {
> > return value.clusterResult.toString();
> > }
> > }
> > ---
> >
> > But this program always get a IO Exception when running in a
> > pseudo-distributed cluster, and the log has been attached at the end of
> this
> > post.
> >
> > There's something wired:
> > 1. If I use the SequenceFileOutputFormat instead of
> > MultipleSequenceFileOutputFormat, this program would works fine( at least
> > there is no error in log).

Jay Vyas
MMSB/UCHC
+
Hien Luu 2012-09-17, 16:35
+
Jason Yang 2012-09-17, 17:00
+
Jason Yang 2012-09-17, 16:11