You can tweak the "NLineInputFormat" as per your requirement and use
it. It allows us to read a specified no of lines
unlike "TextInputFormat". Here is a good post by Boris and Michael on
custom record reader. Also I would suggest you to
combine similar files together into one bigger file if feasible, as you
files are very small.
On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari <[EMAIL PROTECTED]
> Thank you for the suggestion. Actually I was using poi to extract text,
> but since now I have so many documents I thought I will use hadoop
> directly to parse as well. Average size of each document is around 120 kb.
> Also I want to read multiple lines from the text until I find a blank line.
> I do not have any idea ankit how to design custom input format and record
> reader. Pleaser help with some tutorial tutorial, code or resource around
> it. I am struggling with the issue. I will be highly grateful. Thank you so
> much once again
> > Date: Fri, 24 Aug 2012 08:07:39 +0200
> > Subject: Re: Reading multiple lines from a microsoft doc in hadoop
> > From: [EMAIL PROTECTED]
> > To: [EMAIL PROTECTED]
> > It's much easier if you convert the documents to text first
> > use
> > http://tika.apache.org/
> > or some other doc parser
> > -Håvard
> > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> > <[EMAIL PROTECTED]> wrote:
> > > hi,
> > > I have doc files in msword doc and docx format. These have entries
> which are
> > > seperated by an empty line. Is it possible for me to read
> > > these lines separated from empty lines at a time. Also which
> > > shall I use to read doc docx. Please help
> > >
> > > *------------------------*
> > > Cheers !!!
> > > Siddharth Tiwari
> > > Have a refreshing day !!!
> > > "Every duty is holy, and devotion to duty is the highest form of
> worship of
> > > God.”
> > > "Maybe other people will try to limit me but I don't limit myself"
> > --
> > Håvard Wahl Kongsgård
> > Faculty of Medicine &
> > Department of Mathematical Sciences
> > NTNU
> > http://havard.security-review.net/