Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> IOException when using MultipleSequenceFileOutputFormat


+
Jason Yang 2012-09-17, 13:50
+
Harsh J 2012-09-18, 02:38
+
Jason Yang 2012-09-18, 03:44
+
Harsh J 2012-09-18, 03:59
+
Jason Yang 2012-09-18, 06:07
Copy link to this message
-
Re: IOException when using MultipleSequenceFileOutputFormat
>> If I use the SequenceFileOutputFormat instead of
MultipleSequenceFileOutputFormat, this program would works fine( at least
there is no error in log). <<

I might suggest another alternative fix.... Maybe your ulimit is too low in
your psuedodistributed OS?  The fact that you are using a clustering output
means you will have some funny files --- maybe alot of very small ones, and
possibly lots of them, more than you normally would distribute to a single
node, as Harsh suggests..

On Mon, Sep 17, 2012 at 10:38 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> Hi Jason,
>
> How many unique keys are you going to be generating from this program,
> roughly?
>
> By default, the max-load of a DN is about 4k threads and if you're
> trying to push beyond that value then the NN will no longer select the
> DN as it would consider it already overloaded. In a fully distributed
> mode, you may not see this issue as there's several DNs and TTs to
> distribute the write load across.
>
> Try with a smaller input sample if there's a whole lot of keys you'll
> be creating files for, and see if that works instead (such that
> there's fewer files and you do not hit the xceiver/load limits).
>
> On Mon, Sep 17, 2012 at 7:20 PM, Jason Yang <[EMAIL PROTECTED]>
> wrote:
> > Hi, all
> >
> > I have written a simple MR program which partition a file into multiple
> > files bases on the clustering result of the points in this file, here is
> my
> > code:
> > ---
> > private int run() throws IOException
> > {
> > String scheme = getConf().get(CommonUtility.ATTR_SCHEME);
> > String ecgDir = getConf().get(CommonUtility.ATTR_ECG_DATA_DIR);
> > String outputDir = getConf().get(CommonUtility.ATTR_OUTPUT_DIR);
> >
> > // create JobConf
> > JobConf jobConf = new JobConf(getConf(), this.getClass());
> >
> > // set path for input and output
> > Path inPath = new Path(scheme + ecgDir);
> > Path outPath = new Path(scheme + outputDir +
> > CommonUtility.OUTPUT_LOCAL_CLUSTERING);
> > FileInputFormat.addInputPath(jobConf, inPath);
> > FileOutputFormat.setOutputPath(jobConf, outPath);
> >
> > // clear output if it already existed
> > CommonUtility.deleteHDFSFile(outPath.toString());
> >
> > // set format for input and output
> > jobConf.setInputFormat(WholeFileInputFormat.class);
> > jobConf.setOutputFormat(LocalClusterMSFOutputFormat.class);
> >
> > // set class of output key and value
> > jobConf.setOutputKeyClass(Text.class);
> > jobConf.setOutputValueClass(RRIntervalWritable.class);
> >
> > // set mapper and reducer
> > jobConf.setMapperClass(LocalClusteringMapper.class);
> > jobConf.setReducerClass(IdentityReducer.class);
> >
> >
> > // run the job
> > JobClient.runJob(jobConf);
> > return 0;
> > }
> >
> > ...
> >
> > public class LocalClusteringMapper extends MapReduceBase implements
> > Mapper<NullWritable, BytesWritable, Text, RRIntervalWritable>
> > {
> > @Override
> > public void map(NullWritable key, BytesWritable value,
> > OutputCollector<Text, RRIntervalWritable> output, Reporter reporter)
> > throws IOException
> > {
> > //read and cluster
> >                   ...
> >
> > // output
> > Iterator<RRIntervalWritable> it = rrArray.iterator();
> > while (it.hasNext())
> > {
> > RRIntervalWritable rr = it.next();
> >
> > Text outputKey = new Text(rr.clusterResult );
> >
> > output.collect(outputKey, rr);
> > }
> >
> > }
> >
> > ...
> >
> > public class LocalClusterMSFOutputFormat extends
> > MultipleSequenceFileOutputFormat<Text, RRIntervalWritable>
> > {
> >
> > protected String generateFileNameForKeyValue(Text key,
> > RRIntervalWritable value, String name)
> > {
> > return value.clusterResult.toString();
> > }
> > }
> > ---
> >
> > But this program always get a IO Exception when running in a
> > pseudo-distributed cluster, and the log has been attached at the end of
> this
> > post.
> >
> > There's something wired:
> > 1. If I use the SequenceFileOutputFormat instead of
> > MultipleSequenceFileOutputFormat, this program would works fine( at least
> > there is no error in log).

Jay Vyas
MMSB/UCHC
+
Hien Luu 2012-09-17, 16:35
+
Jason Yang 2012-09-17, 17:00
+
Jason Yang 2012-09-17, 16:11
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB