Hi, maybe you should check out the old nutch project
http://nutch.apache.org/ (hadoop was developed for nutch).
It's a web crawler and indexer, but the malinglists hold much info
doc/pdf parsing which also relates to hadoop.
Have never parsed many docx or doc files, but it should be
strait-forward. But generally for text analysis preprocessing is the
KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
<[EMAIL PROTECTED]> wrote:
> Thank you for the suggestion. Actually I was using poi to extract text, but
> since now I have so many documents I thought I will use hadoop directly
> to parse as well. Average size of each document is around 120 kb. Also I
> want to read multiple lines from the text until I find a blank line. I do
> not have any idea ankit how to design custom input format and record reader.
> Pleaser help with some tutorial tutorial, code or resource around it. I am
> struggling with the issue. I will be highly grateful. Thank you so much once
>> Date: Fri, 24 Aug 2012 08:07:39 +0200
>> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> From: [EMAIL PROTECTED]
>> To: [EMAIL PROTECTED]
>> It's much easier if you convert the documents to text first
>> or some other doc parser
>> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> <[EMAIL PROTECTED]> wrote:
>> > hi,
>> > I have doc files in msword doc and docx format. These have entries which
>> > are
>> > seperated by an empty line. Is it possible for me to read
>> > these lines separated from empty lines at a time. Also which inpurformat
>> > shall I use to read doc docx. Please help
>> > *------------------------*
>> > Cheers !!!
>> > Siddharth Tiwari
>> > Have a refreshing day !!!
>> > "Every duty is holy, and devotion to duty is the highest form of worship
>> > of
>> > God.”
>> > "Maybe other people will try to limit me but I don't limit myself"
>> Håvard Wahl Kongsgård
>> Faculty of Medicine &
>> Department of Mathematical Sciences
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
Mohammad Tariq 2012-08-24, 07:40
Mohammad Tariq 2012-08-24, 07:41