|
Siddharth Tiwari
2012-08-24, 05:52
Håvard Wahl Kongsgård
2012-08-24, 06:07
Biju Balakrishnan
2012-08-24, 06:09
Bejoy KS
2012-08-24, 06:09
Bertrand Dechoux
2012-08-24, 06:10
Siddharth Tiwari
2012-08-24, 07:30
Siddharth Tiwari
2012-08-24, 16:22
Siddharth Tiwari
2012-08-24, 20:23
Siddharth Tiwari
2012-08-25, 05:35
Siddharth Tiwari
2012-08-25, 12:07
Harsh J
2012-08-25, 18:17
|
-
Reading multiple lines from a microsoft doc in hadoopSiddharth Tiwari 2012-08-24, 05:52
hi, I have doc files in msword doc and docx format. These have entries which are seperated by an empty line. Is it possible for me to read these lines separated from empty lines at a time. Also which inpurformat shall I use to read doc docx. Please help *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy, and devotion to duty is the highest form of worship of God.” "Maybe other people will try to limit me but I don't limit myself"
-
Re: Reading multiple lines from a microsoft doc in hadoopHåvard Wahl Kongsgård 2012-08-24, 06:07
It's much easier if you convert the documents to text first
use http://tika.apache.org/ or some other doc parser -Håvard On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari <[EMAIL PROTECTED]> wrote: > hi, > I have doc files in msword doc and docx format. These have entries which are > seperated by an empty line. Is it possible for me to read > these lines separated from empty lines at a time. Also which inpurformat > shall I use to read doc docx. Please help > > *------------------------* > Cheers !!! > Siddharth Tiwari > Have a refreshing day !!! > "Every duty is holy, and devotion to duty is the highest form of worship of > God.” > "Maybe other people will try to limit me but I don't limit myself" -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/
-
Re: Reading multiple lines from a microsoft doc in hadoopBiju Balakrishnan 2012-08-24, 06:09
Siddharth,
> I have doc files in msword doc and docx format. These have entries which > are seperated by an empty line. Is it possible for me to read > these lines separated from empty lines at a time. Also which inpurformat > shall I use to read doc docx. Please help > > As far as i know, none of the input format supports the doc & docx(to be noted: as far as i know). you might need to write a custom input format to support doc[x] files. its better to convert to text files before processing using hadoop. -- *Biju *
-
Re: Reading multiple lines from a microsoft doc in hadoopBejoy KS 2012-08-24, 06:09
Hi Siddharth
I believe doc and docx have custom formatting other than text. In that case you may have to build your own input format. Also your own record reader if you want to have the record delimiter as an empty line. Regards Bejoy KS Sent from handheld, please excuse typos. -----Original Message----- From: Siddharth Tiwari <[EMAIL PROTECTED]> Date: Fri, 24 Aug 2012 05:52:13 To: USers Hadoop<[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Reading multiple lines from a microsoft doc in hadoop hi, I have doc files in msword doc and docx format. These have entries which are seperated by an empty line. Is it possible for me to read these lines separated from empty lines at a time. Also which inpurformat shall I use to read doc docx. Please help *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy, and devotion to duty is the highest form of worship of God.” "Maybe other people will try to limit me but I don't limit myself"
-
Re: Reading multiple lines from a microsoft doc in hadoopBertrand Dechoux 2012-08-24, 06:10
And that would help you with performance too.
Were you originally planning to have one file per word document? What is the average size of you word documents? It shouldn't be much. I am afraid your map startup time won't be negligible in that case. Regards Bertrand On Fri, Aug 24, 2012 at 8:07 AM, Håvard Wahl Kongsgård < [EMAIL PROTECTED]> wrote: > It's much easier if you convert the documents to text first > > use > http://tika.apache.org/ > > or some other doc parser > > > -Håvard > > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari > <[EMAIL PROTECTED]> wrote: > > hi, > > I have doc files in msword doc and docx format. These have entries which > are > > seperated by an empty line. Is it possible for me to read > > these lines separated from empty lines at a time. Also which inpurformat > > shall I use to read doc docx. Please help > > > > *------------------------* > > Cheers !!! > > Siddharth Tiwari > > Have a refreshing day !!! > > "Every duty is holy, and devotion to duty is the highest form of worship > of > > God.” > > "Maybe other people will try to limit me but I don't limit myself" > > > > -- > Håvard Wahl Kongsgård > Faculty of Medicine & > Department of Mathematical Sciences > NTNU > > http://havard.security-review.net/ > -- Bertrand Dechoux
-
RE: Reading multiple lines from a microsoft doc in hadoopSiddharth Tiwari 2012-08-24, 07:30
Hi,
Thank you for the suggestion. Actually I was using poi to extract text, but since now I have so many documents I thought I will use hadoop directly to parse as well. Average size of each document is around 120 kb. Also I want to read multiple lines from the text until I find a blank line. I do not have any idea ankit how to design custom input format and record reader. Pleaser help with some tutorial tutorial, code or resource around it. I am struggling with the issue. I will be highly grateful. Thank you so much once again > Date: Fri, 24 Aug 2012 08:07:39 +0200 > Subject: Re: Reading multiple lines from a microsoft doc in hadoop > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > It's much easier if you convert the documents to text first > > use > http://tika.apache.org/ > > or some other doc parser > > > -Håvard > > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari > <[EMAIL PROTECTED]> wrote: > > hi, > > I have doc files in msword doc and docx format. These have entries which are > > seperated by an empty line. Is it possible for me to read > > these lines separated from empty lines at a time. Also which inpurformat > > shall I use to read doc docx. Please help > > > > *------------------------* > > Cheers !!! > > Siddharth Tiwari > > Have a refreshing day !!! > > "Every duty is holy, and devotion to duty is the highest form of worship of > > God.” > > "Maybe other people will try to limit me but I don't limit myself" > > > > -- > Håvard Wahl Kongsgård > Faculty of Medicine & > Department of Mathematical Sciences > NTNU > > http://havard.security-review.net/
-
RE: Reading multiple lines from a microsoft doc in hadoopSiddharth Tiwari 2012-08-24, 16:22
Hi Team, Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ? below is the code I wrote:- import java.io.IOException; import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.JobContext; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.input.LineRecordReader; import org.apache.hadoop.util.LineReader; /** * */ /** * @author 460615 * */ //FileInputFormat is the base class for all file-based InputFormats public class ParaInputFormat extends FileInputFormat<LongWritable,Text> { private String nullRegex = "^\\s*$" ; public String StrLine = null; /*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(genericSplit.toString()); return new ParaInputFormat(job, (FileSplit)genericSplit); }*/ public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException { context.setStatus(genericSplit.toString()); return new LineRecordReader(); } public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException { ArrayList<FileSplit> splits = new ArrayList<FileSplit>(); for (FileStatus status : listStatus(job)) { Path fileName = status.getPath(); if (status.isDir()) { throw new IOException("Not a file: " + fileName); } FileSystem fs = fileName.getFileSystem(conf); LineReader lr = null; try { FSDataInputStream in = fs.open(fileName); lr = new LineReader(in, conf); // String regexMatch =in.readLine(); Text line = new Text(); long begin = 0; long length = 0; int num = -1; String boolTest = null; boolean match = false; Pattern p = Pattern.compile(nullRegex); // Matcher matcher = new p.matcher(); while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){ // numLines++; length += num; splits.add(new FileSplit(fileName, begin, length, new String[]{}));} begin=length; }finally { if (lr != null) { lr.close(); } } } return splits.toArray(new FileSplit[splits.size()]); } } *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy, and devotion to duty is the highest form of worship of God.” "Maybe other people will try to limit me but I don't limit myself" > Date: Fri, 24 Aug 2012 09:54:10 +0200 > Subject: Re: Reading multiple lines from a microsoft doc in hadoop > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > Hi, maybe you should check out the old nutch project > http://nutch.apache.org/ (hadoop was developed for nutch). > It's a web crawler and indexer, but the malinglists hold much info > doc/pdf parsing which also relates to hadoop. > > Have never parsed many docx or doc files, but it should be > strait-forward. But generally for text analysis preprocessing is the > KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a > simple trick) > > > -Håvard > > On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari > <[EMAIL PROTECTED]> wrote: > > Hi,
-
RE: Reading multiple lines from a microsoft doc in hadoopSiddharth Tiwari 2012-08-24, 20:23
Hi , Can anyone please help ? Thank you in advance *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy, and devotion to duty is the highest form of worship of God.” "Maybe other people will try to limit me but I don't limit myself" From: [EMAIL PROTECTED] To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: Reading multiple lines from a microsoft doc in hadoop Date: Fri, 24 Aug 2012 16:22:57 +0000 Hi Team, Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ? below is the code I wrote:- import java.io.IOException; import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.JobContext; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.input.LineRecordReader; import org.apache.hadoop.util.LineReader; /** * */ /** * @author 460615 * */ //FileInputFormat is the base class for all file-based InputFormats public class ParaInputFormat extends FileInputFormat<LongWritable,Text> { private String nullRegex = "^\\s*$" ; public String StrLine = null; /*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(genericSplit.toString()); return new ParaInputFormat(job, (FileSplit)genericSplit); }*/ public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException { context.setStatus(genericSplit.toString()); return new LineRecordReader(); } public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException { ArrayList<FileSplit> splits = new ArrayList<FileSplit>(); for (FileStatus status : listStatus(job)) { Path fileName = status.getPath(); if (status.isDir()) { throw new IOException("Not a file: " + fileName); } FileSystem fs = fileName.getFileSystem(conf); LineReader lr = null; try { FSDataInputStream in = fs.open(fileName); lr = new LineReader(in, conf); // String regexMatch =in.readLine(); Text line = new Text(); long begin = 0; long length = 0; int num = -1; String boolTest = null; boolean match = false; Pattern p = Pattern.compile(nullRegex); // Matcher matcher = new p.matcher(); while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){ // numLines++; length += num; splits.add(new FileSplit(fileName, begin, length, new String[]{}));} begin=length; }finally { if (lr != null) { lr.close(); } } } return splits.toArray(new FileSplit[splits.size()]); } } *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy, and devotion to duty is the highest form of worship of God.” "Maybe other people will try to limit me but I don't limit myself" > Date: Fri, 24 Aug 2012 09:54:10 +0200 > Subject: Re: Reading multiple lines from a microsoft doc in hadoop > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > Hi, maybe you should check out the old nutch project > http://nutch.apache.org/ (hadoop was developed for nutch).
-
RE: Reading multiple lines from a microsoft doc in hadoopSiddharth Tiwari 2012-08-25, 05:35
Any help on below would be really appreciated. i am stuck with it *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy, and devotion to duty is the highest form of worship of God.” "Maybe other people will try to limit me but I don't limit myself" From: [EMAIL PROTECTED] To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: Reading multiple lines from a microsoft doc in hadoop Date: Fri, 24 Aug 2012 20:23:45 +0000 Hi , Can anyone please help ? Thank you in advance *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy, and devotion to duty is the highest form of worship of God.” "Maybe other people will try to limit me but I don't limit myself" From: [EMAIL PROTECTED] To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: Reading multiple lines from a microsoft doc in hadoop Date: Fri, 24 Aug 2012 16:22:57 +0000 Hi Team, Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ? below is the code I wrote:- import java.io.IOException; import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.JobContext; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.input.LineRecordReader; import org.apache.hadoop.util.LineReader; /** * */ /** * @author 460615 * */ //FileInputFormat is the base class for all file-based InputFormats public class ParaInputFormat extends FileInputFormat<LongWritable,Text> { private String nullRegex = "^\\s*$" ; public String StrLine = null; /*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(genericSplit.toString()); return new ParaInputFormat(job, (FileSplit)genericSplit); }*/ public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException { context.setStatus(genericSplit.toString()); return new LineRecordReader(); } public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException { ArrayList<FileSplit> splits = new ArrayList<FileSplit>(); for (FileStatus status : listStatus(job)) { Path fileName = status.getPath(); if (status.isDir()) { throw new IOException("Not a file: " + fileName); } FileSystem fs = fileName.getFileSystem(conf); LineReader lr = null; try { FSDataInputStream in = fs.open(fileName); lr = new LineReader(in, conf); // String regexMatch =in.readLine(); Text line = new Text(); long begin = 0; long length = 0; int num = -1; String boolTest = null; boolean match = false; Pattern p = Pattern.compile(nullRegex); // Matcher matcher = new p.matcher(); while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){ // numLines++; length += num; splits.add(new FileSplit(fileName, begin, length, new String[]{}));} begin=length; }finally { if (lr != null) { lr.close(); } } } return splits.toArray(new FileSplit[splits.size()]); } } *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy, and devotion to duty is the highest form of worship of God.” "Maybe other people will try to limit me but I don't limit myself"
-
RE: Reading multiple lines from a microsoft doc in hadoopSiddharth Tiwari 2012-08-25, 12:07
CAn anybody enlighten me on what could be wrongg ? *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy, and devotion to duty is the highest form of worship of God.” "Maybe other people will try to limit me but I don't limit myself" From: [EMAIL PROTECTED] To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: Reading multiple lines from a microsoft doc in hadoop Date: Sat, 25 Aug 2012 05:35:48 +0000 Any help on below would be really appreciated. i am stuck with it *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy, and devotion to duty is the highest form of worship of God.” "Maybe other people will try to limit me but I don't limit myself" From: [EMAIL PROTECTED] To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: Reading multiple lines from a microsoft doc in hadoop Date: Fri, 24 Aug 2012 20:23:45 +0000 Hi , Can anyone please help ? Thank you in advance *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy, and devotion to duty is the highest form of worship of God.” "Maybe other people will try to limit me but I don't limit myself" From: [EMAIL PROTECTED] To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: Reading multiple lines from a microsoft doc in hadoop Date: Fri, 24 Aug 2012 16:22:57 +0000 Hi Team, Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ? below is the code I wrote:- import java.io.IOException; import java.util.ArrayList; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.JobContext; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.input.LineRecordReader; import org.apache.hadoop.util.LineReader; /** * */ /** * @author 460615 * */ //FileInputFormat is the base class for all file-based InputFormats public class ParaInputFormat extends FileInputFormat<LongWritable,Text> { private String nullRegex = "^\\s*$" ; public String StrLine = null; /*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(genericSplit.toString()); return new ParaInputFormat(job, (FileSplit)genericSplit); }*/ public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException { context.setStatus(genericSplit.toString()); return new LineRecordReader(); } public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException { ArrayList<FileSplit> splits = new ArrayList<FileSplit>(); for (FileStatus status : listStatus(job)) { Path fileName = status.getPath(); if (status.isDir()) { throw new IOException("Not a file: " + fileName); } FileSystem fs = fileName.getFileSystem(conf); LineReader lr = null; try { FSDataInputStream in = fs.open(fileName); lr = new LineReader(in, conf); // String regexMatch =in.readLine(); Text line = new Text(); long begin = 0; long length = 0; int num = -1; String boolTest = null; boolean match = false; Pattern p = Pattern.compile(nullRegex); // Matcher matcher = new p.matcher(); while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){ // numLines++; length += num; splits.add(new FileSplit(fileName, begin, length, new String[]{}));} begin=length; }finally { if (lr != null) { lr.close(); } } } return splits.toArray(new FileSplit[splits.size()]); } } *------------------------* Cheers !!! Siddharth Tiwari Have a refreshing day !!! "Every duty is holy, and devotion to duty is the highest form of worship of God.” "Maybe other people will try to limit me but I don't limit myself"
-
Re: Reading multiple lines from a microsoft doc in hadoopHarsh J 2012-08-25, 18:17
Hi Siddharth,
First of all, please understand the medium - Mailing lists aren't immediate or interactive help mediums, please be patient for the ones who help you out of their own time. Secondly, take a read of http://www.catb.org/~esr/faqs/smart-questions.html for understanding why certain etiquette is beneficial to both ends. Your requirement here seems to be that you want to read all text in a file, in records separated by two newlines. Depending on the version of Hadoop you use, I think you can probably set "textinputformat.record.delimiter" to "\n\n" or "\r\n\r\n" to have this working with the TextInputFormat itself. On Sat, Aug 25, 2012 at 5:37 PM, Siddharth Tiwari <[EMAIL PROTECTED]> wrote: > > CAn anybody enlighten me on what could be wrongg ? > > > *------------------------* > Cheers !!! > Siddharth Tiwari > Have a refreshing day !!! > "Every duty is holy, and devotion to duty is the highest form of worship of > God.” > "Maybe other people will try to limit me but I don't limit myself" > > > ________________________________ > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: RE: Reading multiple lines from a microsoft doc in hadoop > Date: Sat, 25 Aug 2012 05:35:48 +0000 > > > > Any help on below would be really appreciated. i am stuck with it > > *------------------------* > Cheers !!! > Siddharth Tiwari > Have a refreshing day !!! > "Every duty is holy, and devotion to duty is the highest form of worship of > God.” > "Maybe other people will try to limit me but I don't limit myself" > > > ________________________________ > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: RE: Reading multiple lines from a microsoft doc in hadoop > Date: Fri, 24 Aug 2012 20:23:45 +0000 > > Hi , > > Can anyone please help ? > > Thank you in advance > > > *------------------------* > Cheers !!! > Siddharth Tiwari > Have a refreshing day !!! > "Every duty is holy, and devotion to duty is the highest form of worship of > God.” > "Maybe other people will try to limit me but I don't limit myself" > > > ________________________________ > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: RE: Reading multiple lines from a microsoft doc in hadoop > Date: Fri, 24 Aug 2012 16:22:57 +0000 > > Hi Team, > > Thanks a lot for so many good suggestions. I wrote a custom input format for > reading one paragraph at a time. But when I use it I get lines read. Can you > please suggest what changes I must make to read one para at a time seperated > by null lines ? > below is the code I wrote:- > > > import java.io.IOException; > import java.util.ArrayList; > import java.util.regex.Matcher; > import java.util.regex.Pattern; > import java.io.IOException; > import java.util.ArrayList; > import java.util.List; > > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.fs.FSDataInputStream; > import org.apache.hadoop.fs.FileStatus; > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.Path; > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapred.JobConf; > import org.apache.hadoop.mapreduce.InputSplit; > import org.apache.hadoop.mapreduce.Job; > import org.apache.hadoop.mapreduce.JobContext; > import org.apache.hadoop.mapreduce.RecordReader; > import org.apache.hadoop.mapreduce.TaskAttemptContext; > import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; > import org.apache.hadoop.mapreduce.lib.input.FileSplit; > import org.apache.hadoop.mapreduce.lib.input.LineRecordReader; > import org.apache.hadoop.util.LineReader; > > > > > /** > * > */ > > /** > * @author 460615 > * > */ > //FileInputFormat is the base class for all file-based InputFormats > public class ParaInputFormat extends FileInputFormat<LongWritable,Text> { > private String nullRegex = "^\\s*$" ; Harsh J |