|
|
Huazhong Ning 2009-12-28, 18:27
Hi all,
I need your help on multiple file output. I have many big files and I hope the processing result of each file is outputted to a separate file. I know in the old Hadoop APIs, the class MultipleOutputFormat works for this propose. But I cannot find the same class in new APIs. Does anybody know in the new APIs how to solve this problem? Thanks a lot.
Ning, Huazhong
-
Re: Multiple file output
Farhan Husain 2010-01-05, 00:13
Hello,
You can extend FileOutputFormat class. Here is an example:
public class MultipleTextOutputFormatByPredicates<K, V> extends > FileOutputFormat<K, V> { > protected static class MultipleOutputByPredicatesLineRecordWriter<K, V> extends RecordWriter<K, V> { > private static final String utf8 = "UTF-8"; > private static final byte[] newline; > static { > try { > newline = "\n".getBytes(utf8); > } catch (UnsupportedEncodingException uee) { > throw new IllegalArgumentException("can't find " + utf8 > + " encoding"); > } > } > > protected TaskAttemptContext job; > protected CompressionCodec codec; > protected String extension = ""; > protected Map<String, DataOutputStream> outMap; > private final byte[] keyValueSeparator; > > public MultipleOutputByPredicatesLineRecordWriter(CompressionCodec codec, > String keyValueSeparator, > TaskAttemptContext job) { > this.job = job; > this.codec = codec; > if (null != codec) > this.extension = codec.getDefaultExtension(); > try { > this.keyValueSeparator = keyValueSeparator.getBytes(utf8); > } catch (UnsupportedEncodingException uee) { > throw new IllegalArgumentException("can't find " + utf8 > + " encoding"); > } > outMap = new HashMap<String, DataOutputStream>(); > } > > public MultipleOutputByPredicatesLineRecordWriter(CompressionCodec codec, TaskAttemptContext job) { > this(codec, "\t", job); > } > > /** > * Write the object to the byte stream, handling Text as a special case. > * > * @param o > * the object to print > * @throws IOException > * if the write throws, we pass it on > */ > private void writeObject(Object o, DataOutputStream out) throws IOException { > if (o instanceof Text) { > Text to = (Text) o; > out.write(to.getBytes(), 0, to.getLength()); > } else { > out.write(o.toString().getBytes(utf8)); > } > } > > public synchronized void write(K key, V value) throws IOException { > > boolean nullKey = key == null || key instanceof NullWritable; > boolean nullValue = value == null || value instanceof NullWritable; > if (nullKey || nullValue) { > return; > } > String sPredicate = key.toString().replace(':', '_'); > DataOutputStream out = outMap.get(sPredicate); > if (null == out) { > Path file = new Path(job.getConfiguration().get("mapred.output.dir"), sPredicate); > FileSystem fs = file.getFileSystem(job.getConfiguration()); > FSDataOutputStream fileOut = fs.create(file, false); > outMap.put(sPredicate, fileOut); > out = fileOut; > } > out.write(keyValueSeparator); > writeObject(value, out); > out.write(newline); > } > > public synchronized void close(TaskAttemptContext context) > throws IOException { > Iterator<DataOutputStream> iter = outMap.values().iterator(); > while (iter.hasNext()) > iter.next().close(); > } > } > > public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) > throws IOException, InterruptedException { > Configuration conf = job.getConfiguration(); > boolean isCompressed = getCompressOutput(job); > String keyValueSeparator = conf.get( > "mapred.textoutputformat.separator", "\t"); > CompressionCodec codec = null; > if (isCompressed) { > Class<? extends CompressionCodec> codecClass = getOutputCompressorClass( > job, GzipCodec.class); > codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, > conf); > } > return new MultipleOutputByPredicatesLineRecordWriter<K, V>(codec, keyValueSeparator, job); > } > } > > Thanks, Farhan
On Mon, Dec 28, 2009 at 12:27 PM, Huazhong Ning <[EMAIL PROTECTED]> wrote:
> Hi all, > > I need your help on multiple file output. I have many big files and I hope > the processing result of each file is outputted to a separate file. I know > in the old Hadoop APIs, the class MultipleOutputFormat works for this > propose. But I cannot find the same class in new APIs. Does anybody know in
-
Re: Multiple file output
松柳 2010-01-05, 12:26
I'm afraid you have to write it by yourself, since there are no equivalent classes in new API.
2009/12/28 Huazhong Ning <[EMAIL PROTECTED]>
> Hi all, > > I need your help on multiple file output. I have many big files and I hope > the processing result of each file is outputted to a separate file. I know > in the old Hadoop APIs, the class MultipleOutputFormat works for this > propose. But I cannot find the same class in new APIs. Does anybody know in > the new APIs how to solve this problem? > Thanks a lot. > > Ning, Huazhong > > >
-
Re: Multiple file output
Amareshwari Sri Ramadasu 2010-01-06, 04:59
In branch 0.21, You can get the functionality of both org.apache.hadoop.mapred.lib.MultipleOutputs and org.apache.hadop.mapred.lib.MultipleOutputFormat in org.apache.hadoop.mapreduce.lib.output.MultipleOutputs. Please see MAPREDUCE-370 for more details.
Thanks Amareshwari
On 1/5/10 5:56 PM, "松柳" <[EMAIL PROTECTED]> wrote:
I'm afraid you have to write it by yourself, since there are no equivalent classes in new API.
2009/12/28 Huazhong Ning <[EMAIL PROTECTED]>
> Hi all, > > I need your help on multiple file output. I have many big files and I hope > the processing result of each file is outputted to a separate file. I know > in the old Hadoop APIs, the class MultipleOutputFormat works for this > propose. But I cannot find the same class in new APIs. Does anybody know in > the new APIs how to solve this problem? > Thanks a lot. > > Ning, Huazhong > > >
-
Re: Multiple file output
Vijay 2010-01-06, 06:22
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs is not part of the released version of 0.20.1 right? Is this expected to be part of 0.20.2 or later? 2010/1/5 Amareshwari Sri Ramadasu <[EMAIL PROTECTED]>
> In branch 0.21, You can get the functionality of both > org.apache.hadoop.mapred.lib.MultipleOutputs and > org.apache.hadop.mapred.lib.MultipleOutputFormat in > org.apache.hadoop.mapreduce.lib.output.MultipleOutputs. Please see > MAPREDUCE-370 for more details. > > Thanks > Amareshwari > > On 1/5/10 5:56 PM, "松柳" <[EMAIL PROTECTED]> wrote: > > I'm afraid you have to write it by yourself, since there are no equivalent > classes in new API. > > 2009/12/28 Huazhong Ning <[EMAIL PROTECTED]> > > > Hi all, > > > > I need your help on multiple file output. I have many big files and I > hope > > the processing result of each file is outputted to a separate file. I > know > > in the old Hadoop APIs, the class MultipleOutputFormat works for this > > propose. But I cannot find the same class in new APIs. Does anybody know > in > > the new APIs how to solve this problem? > > Thanks a lot. > > > > Ning, Huazhong > > > > > > > >
-
Re: Multiple file output
Amareshwari Sri Ramadasu 2010-01-06, 08:20
No. It is part of branch 0.21 onwards. For 0.20*, people can use old api only, though JobConf is deprecated.
-Amareshwari.
On 1/6/10 11:52 AM, "Vijay" <[EMAIL PROTECTED]> wrote:
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs is not part of the released version of 0.20.1 right? Is this expected to be part of 0.20.2 or later? 2010/1/5 Amareshwari Sri Ramadasu <[EMAIL PROTECTED]>
> In branch 0.21, You can get the functionality of both > org.apache.hadoop.mapred.lib.MultipleOutputs and > org.apache.hadop.mapred.lib.MultipleOutputFormat in > org.apache.hadoop.mapreduce.lib.output.MultipleOutputs. Please see > MAPREDUCE-370 for more details. > > Thanks > Amareshwari > > On 1/5/10 5:56 PM, "松柳" <[EMAIL PROTECTED]> wrote: > > I'm afraid you have to write it by yourself, since there are no equivalent > classes in new API. > > 2009/12/28 Huazhong Ning <[EMAIL PROTECTED]> > > > Hi all, > > > > I need your help on multiple file output. I have many big files and I > hope > > the processing result of each file is outputted to a separate file. I > know > > in the old Hadoop APIs, the class MultipleOutputFormat works for this > > propose. But I cannot find the same class in new APIs. Does anybody know > in > > the new APIs how to solve this problem? > > Thanks a lot. > > > > Ning, Huazhong > > > > > > > >
-
Re: Multiple file output
Aaron Kimball 2010-01-07, 19:11
Note that org.apache.hadoop.mapreduce.lib.output.MultipleOutputs is scheduled for the next CDH 0.20 release -- ready "soon." - Aaron
2010/1/6 Amareshwari Sri Ramadasu <[EMAIL PROTECTED]>
> No. It is part of branch 0.21 onwards. For 0.20*, people can use old api > only, though JobConf is deprecated. > > -Amareshwari. > > On 1/6/10 11:52 AM, "Vijay" <[EMAIL PROTECTED]> wrote: > > org.apache.hadoop.mapreduce.lib.output.MultipleOutputs is not part of the > released version of 0.20.1 right? Is this expected to be part of 0.20.2 or > later? > > > 2010/1/5 Amareshwari Sri Ramadasu <[EMAIL PROTECTED]> > > > In branch 0.21, You can get the functionality of both > > org.apache.hadoop.mapred.lib.MultipleOutputs and > > org.apache.hadop.mapred.lib.MultipleOutputFormat in > > org.apache.hadoop.mapreduce.lib.output.MultipleOutputs. Please see > > MAPREDUCE-370 for more details. > > > > Thanks > > Amareshwari > > > > On 1/5/10 5:56 PM, "松柳" <[EMAIL PROTECTED]> wrote: > > > > I'm afraid you have to write it by yourself, since there are no > equivalent > > classes in new API. > > > > 2009/12/28 Huazhong Ning <[EMAIL PROTECTED]> > > > > > Hi all, > > > > > > I need your help on multiple file output. I have many big files and I > > hope > > > the processing result of each file is outputted to a separate file. I > > know > > > in the old Hadoop APIs, the class MultipleOutputFormat works for this > > > propose. But I cannot find the same class in new APIs. Does anybody > know > > in > > > the new APIs how to solve this problem? > > > Thanks a lot. > > > > > > Ning, Huazhong > > > > > > > > > > > > > > >
-
Re: Multiple file output
Vijay 2010-01-07, 23:13
That's great news! Thanks guys!
On Thu, Jan 7, 2010 at 11:11 AM, Aaron Kimball <[EMAIL PROTECTED]> wrote:
> Note that org.apache.hadoop.mapreduce.lib.output.MultipleOutputs is > scheduled for the next CDH 0.20 release -- ready "soon." > - Aaron > > 2010/1/6 Amareshwari Sri Ramadasu <[EMAIL PROTECTED]> > > > No. It is part of branch 0.21 onwards. For 0.20*, people can use old api > > only, though JobConf is deprecated. > > > > -Amareshwari. > > > > On 1/6/10 11:52 AM, "Vijay" <[EMAIL PROTECTED]> wrote: > > > > org.apache.hadoop.mapreduce.lib.output.MultipleOutputs is not part of the > > released version of 0.20.1 right? Is this expected to be part of 0.20.2 > or > > later? > > > > > > 2010/1/5 Amareshwari Sri Ramadasu <[EMAIL PROTECTED]> > > > > > In branch 0.21, You can get the functionality of both > > > org.apache.hadoop.mapred.lib.MultipleOutputs and > > > org.apache.hadop.mapred.lib.MultipleOutputFormat in > > > org.apache.hadoop.mapreduce.lib.output.MultipleOutputs. Please see > > > MAPREDUCE-370 for more details. > > > > > > Thanks > > > Amareshwari > > > > > > On 1/5/10 5:56 PM, "松柳" <[EMAIL PROTECTED]> wrote: > > > > > > I'm afraid you have to write it by yourself, since there are no > > equivalent > > > classes in new API. > > > > > > 2009/12/28 Huazhong Ning <[EMAIL PROTECTED]> > > > > > > > Hi all, > > > > > > > > I need your help on multiple file output. I have many big files and I > > > hope > > > > the processing result of each file is outputted to a separate file. I > > > know > > > > in the old Hadoop APIs, the class MultipleOutputFormat works for this > > > > propose. But I cannot find the same class in new APIs. Does anybody > > know > > > in > > > > the new APIs how to solve this problem? > > > > Thanks a lot. > > > > > > > > Ning, Huazhong > > > > > > > > > > > > > > > > > > > > > > >
|
|