|
|
-
Re: Reading multiple lines from a microsoft doc in hadoopMohammad Tariq 2012-08-24, 07:40
Hello Siddharth,
You can tweak the "NLineInputFormat" as per your requirement and use it. It allows us to read a specified no of lines unlike "TextInputFormat". Here is a good post by Boris and Michael on custom record reader. Also I would suggest you to combine similar files together into one bigger file if feasible, as you files are very small. Regards, Mohammad Tariq On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari <[EMAIL PROTECTED] > wrote: > Hi, > Thank you for the suggestion. Actually I was using poi to extract text, > but since now I have so many documents I thought I will use hadoop > directly to parse as well. Average size of each document is around 120 kb. > Also I want to read multiple lines from the text until I find a blank line. > I do not have any idea ankit how to design custom input format and record > reader. Pleaser help with some tutorial tutorial, code or resource around > it. I am struggling with the issue. I will be highly grateful. Thank you so > much once again > > > Date: Fri, 24 Aug 2012 08:07:39 +0200 > > Subject: Re: Reading multiple lines from a microsoft doc in hadoop > > From: [EMAIL PROTECTED] > > To: [EMAIL PROTECTED] > > > > It's much easier if you convert the documents to text first > > > > use > > http://tika.apache.org/ > > > > or some other doc parser > > > > > > -Håvard > > > > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari > > <[EMAIL PROTECTED]> wrote: > > > hi, > > > I have doc files in msword doc and docx format. These have entries > which are > > > seperated by an empty line. Is it possible for me to read > > > these lines separated from empty lines at a time. Also which > inpurformat > > > shall I use to read doc docx. Please help > > > > > > *------------------------* > > > Cheers !!! > > > Siddharth Tiwari > > > Have a refreshing day !!! > > > "Every duty is holy, and devotion to duty is the highest form of > worship of > > > God.” > > > "Maybe other people will try to limit me but I don't limit myself" > > > > > > > > -- > > Håvard Wahl Kongsgård > > Faculty of Medicine & > > Department of Mathematical Sciences > > NTNU > > > > http://havard.security-review.net/ > |