|
sangroya
2012-01-30, 15:11
John Conwell
2012-01-30, 16:40
sangroya
2012-02-08, 13:59
bejoy.hadoop@...
2012-02-08, 15:33
Owen O'Malley
2012-02-08, 17:05
|
-
Sorting text datasangroya 2012-01-30, 15:11
Hello,
I have a large amount of text file 1GB, that I want to sort. So far, I know of hadoop examples that takes sequence file as an input to sort program. Does anyone know of any implementation that uses text data as input? Thanks, Amit ----- Sangroya -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-text-data-tp3700231p3700231.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
-
Re: Sorting text dataJohn Conwell 2012-01-30, 16:40
If you use the TextInputFormat is your mapreduce job's input format, then
Hadoop doesn't need your input data to be in a sequence file. It will read your text file, and call the mapper for each line in the text file (\n delimited), where the key value is the byte offset of that line from the beginning of the file, and the value is the text value of that line. In the mapper, if you set the output key to the mapper's input value (the text you want sorted), than hadoop will automatically sort the text as it figures out what key/value mapper output pairs to send to what reducers as input. You can then just dump the reducer input straight to the reducer output without any data manipulation. Make sure your reducer output format is set to TextOutputFormat. On Mon, Jan 30, 2012 at 7:11 AM, sangroya <[EMAIL PROTECTED]> wrote: > Hello, > > I have a large amount of text file 1GB, that I want to sort. So far, I know > of hadoop examples that takes sequence file as an input to sort program. > > Does anyone know of any implementation that uses text data as input? > > Thanks, > Amit > > ----- > Sangroya > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Sorting-text-data-tp3700231p3700231.html > Sent from the Hadoop lucene-users mailing list archive at Nabble.com. > -- Thanks, John C
-
Re: Sorting text datasangroya 2012-02-08, 13:59
Hi,
I tried to run the sort example by specifying the input format. But I got the following error, while running it. bin/hadoop jar hadoop-0.20.2-examples.jar sort -inFormat org.apache.hadoop.mapred.TextInputFormat /user/sangroya/test1 outtest16 Running on 1 nodes to sort from hdfs://localhost:54310/user/sangroya/test1 into hdfs://localhost:54310/user/sangroya/outtest16 with 1 reduces. Job started: Wed Feb 08 14:53:14 CET 2012 12/02/08 14:53:14 INFO mapred.FileInputFormat: Total input paths to process : 1 12/02/08 14:53:14 INFO mapred.JobClient: Running job: job_201202021340_0030 12/02/08 14:53:15 INFO mapred.JobClient: map 0% reduce 0% 12/02/08 14:53:27 INFO mapred.JobClient: Task Id : attempt_201202021340_0030_m_000000_0, Status : FAILED java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.BytesWritable, recieved org.apache.hadoop.io.LongWritable at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:40) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Can you please suggest me what is the issue. I also tried the following by specifying everything: bin/hadoop jar hadoop-0.20.2-examples.jar sort -inFormat org.apache.hadoop.mapred.TextInputFormat -outFormat org.apache.hadoop.mapred.TextOutputFormat -outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text /user/sangroya/test1/ outtest11 But still it seems that there is a type mismatch issue. Running on 1 nodes to sort from hdfs://localhost:54310/user/sangroya/test1 into hdfs://localhost:54310/user/sangroya/outtest88 with 1 reduces. Job started: Wed Feb 08 14:57:19 CET 2012 12/02/08 14:57:19 INFO mapred.FileInputFormat: Total input paths to process : 1 12/02/08 14:57:19 INFO mapred.JobClient: Running job: job_201202021340_0031 12/02/08 14:57:20 INFO mapred.JobClient: map 0% reduce 0% 12/02/08 14:57:33 INFO mapred.JobClient: Task Id : attempt_201202021340_0031_m_000000_0, Status : FAILED java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:40) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) My input data is a text file. Please help me out! Thanks, Amit ----- Sangroya -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-text-data-tp3700231p3725997.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
-
Re: Sorting text databejoy.hadoop@... 2012-02-08, 15:33
Hi Sangrova
Your map method is emitting key values pairs whose type is different than the expected types specified in your driver class. TextInputFormat has LongWritableKeys and TextValues and I believe that is creating the error. As per the code the expected key from a mapper is of BytesWritable since you have specified TextInputFormat the mapper is emitting LongWritable Keys. Using the correct InputFormat should resolve your issue. Since you are using an IdentityMapper you can even give a try by specifying the map output Key and Value types along with the InputFormat. -inFormat org.apache.hadoop.mapred.TextInputFormat java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.BytesWritable, recieved org.apache.hadoop.io.LongWritable at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845) Regards Bejoy K S From handheld, Please excuse typos. -----Original Message----- From: sangroya <[EMAIL PROTECTED]> Date: Wed, 8 Feb 2012 05:59:14 To: <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Re: Sorting text data Hi, I tried to run the sort example by specifying the input format. But I got the following error, while running it. bin/hadoop jar hadoop-0.20.2-examples.jar sort -inFormat org.apache.hadoop.mapred.TextInputFormat /user/sangroya/test1 outtest16 Running on 1 nodes to sort from hdfs://localhost:54310/user/sangroya/test1 into hdfs://localhost:54310/user/sangroya/outtest16 with 1 reduces. Job started: Wed Feb 08 14:53:14 CET 2012 12/02/08 14:53:14 INFO mapred.FileInputFormat: Total input paths to process : 1 12/02/08 14:53:14 INFO mapred.JobClient: Running job: job_201202021340_0030 12/02/08 14:53:15 INFO mapred.JobClient: map 0% reduce 0% 12/02/08 14:53:27 INFO mapred.JobClient: Task Id : attempt_201202021340_0030_m_000000_0, Status : FAILED java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.BytesWritable, recieved org.apache.hadoop.io.LongWritable at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:40) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Can you please suggest me what is the issue. I also tried the following by specifying everything: bin/hadoop jar hadoop-0.20.2-examples.jar sort -inFormat org.apache.hadoop.mapred.TextInputFormat -outFormat org.apache.hadoop.mapred.TextOutputFormat -outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text /user/sangroya/test1/ outtest11 But still it seems that there is a type mismatch issue. Running on 1 nodes to sort from hdfs://localhost:54310/user/sangroya/test1 into hdfs://localhost:54310/user/sangroya/outtest88 with 1 reduces. Job started: Wed Feb 08 14:57:19 CET 2012 12/02/08 14:57:19 INFO mapred.FileInputFormat: Total input paths to process : 1 12/02/08 14:57:19 INFO mapred.JobClient: Running job: job_201202021340_0031 12/02/08 14:57:20 INFO mapred.JobClient: map 0% reduce 0% 12/02/08 14:57:33 INFO mapred.JobClient: Task Id : attempt_201202021340_0031_m_000000_0, Status : FAILED java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466) at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:40) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) My input data is a text file. Please help me out! Thanks, Amit Sangroya View this message in context: http://lucene.472066.n3.nabble.com/Sorting-text-data-tp3700231p3725997.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
-
Re: Sorting text dataOwen O'Malley 2012-02-08, 17:05
On Wed, Feb 8, 2012 at 5:59 AM, sangroya <[EMAIL PROTECTED]> wrote:
> Hi, > > I tried to run the sort example by specifying the input format. But I got > the following error, while running it. You actually need a different mapper to make the whole thing work. I made a patch for Sort.java that should do the trick. https://gist.github.com/1770850 Just run the sort with -text and it will set the input format, output format, key type, value type, and also set the mapper that I added so that you move the line to the key instead of the value. -- Owen |