Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Re: Reading multiple lines from a microsoft doc in hadoop


Copy link to this message
-
Re: Reading multiple lines from a microsoft doc in hadoop
Håvard Wahl Kongsgård 2012-08-24, 07:54
Hi, maybe you should check out the old nutch project
http://nutch.apache.org/ (hadoop was developed for nutch).
It's a web crawler and indexer, but the malinglists hold much info
doc/pdf parsing which also relates to hadoop.

Have never parsed many docx or doc files, but it should be
strait-forward. But generally for text analysis preprocessing is the
KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
simple trick)
-Håvard

On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
<[EMAIL PROTECTED]> wrote:
> Hi,
> Thank you for the suggestion. Actually I was using poi to extract text, but
> since now  I  have so many  documents I thought I will use hadoop directly
> to parse as well. Average size of each document is around 120 kb. Also I
> want to read multiple lines from the text until I find a blank line. I do
> not have any idea ankit how to design custom input format and record reader.
> Pleaser help with some tutorial tutorial, code or resource around it. I am
> struggling with the issue. I will be highly grateful. Thank you so much once
> again
>
>> Date: Fri, 24 Aug 2012 08:07:39 +0200
>> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> From: [EMAIL PROTECTED]
>> To: [EMAIL PROTECTED]
>
>>
>> It's much easier if you convert the documents to text first
>>
>> use
>> http://tika.apache.org/
>>
>> or some other doc parser
>>
>>
>> -Håvard
>>
>> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> <[EMAIL PROTECTED]> wrote:
>> > hi,
>> > I have doc files in msword doc and docx format. These have entries which
>> > are
>> > seperated by an empty line. Is it possible for me to read
>> > these lines separated from empty lines at a time. Also which inpurformat
>> > shall I use to read doc docx. Please help
>> >
>> > *------------------------*
>> > Cheers !!!
>> > Siddharth Tiwari
>> > Have a refreshing day !!!
>> > "Every duty is holy, and devotion to duty is the highest form of worship
>> > of
>> > God.”
>> > "Maybe other people will try to limit me but I don't limit myself"
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> Faculty of Medicine &
>> Department of Mathematical Sciences
>> NTNU
>>
>> http://havard.security-review.net/

--
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/