Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: Reading multiple lines from a microsoft doc in hadoop


Copy link to this message
-
Re: Reading multiple lines from a microsoft doc in hadoop
Hi, maybe you should check out the old nutch project
http://nutch.apache.org/ (hadoop was developed for nutch).
It's a web crawler and indexer, but the malinglists hold much info
doc/pdf parsing which also relates to hadoop.

Have never parsed many docx or doc files, but it should be
strait-forward. But generally for text analysis preprocessing is the
KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a
simple trick)
-Håvard

On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari
<[EMAIL PROTECTED]> wrote:
> Hi,
> Thank you for the suggestion. Actually I was using poi to extract text, but
> since now  I  have so many  documents I thought I will use hadoop directly
> to parse as well. Average size of each document is around 120 kb. Also I
> want to read multiple lines from the text until I find a blank line. I do
> not have any idea ankit how to design custom input format and record reader.
> Pleaser help with some tutorial tutorial, code or resource around it. I am
> struggling with the issue. I will be highly grateful. Thank you so much once
> again
>
>> Date: Fri, 24 Aug 2012 08:07:39 +0200
>> Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> From: [EMAIL PROTECTED]
>> To: [EMAIL PROTECTED]
>
>>
>> It's much easier if you convert the documents to text first
>>
>> use
>> http://tika.apache.org/
>>
>> or some other doc parser
>>
>>
>> -Håvard
>>
>> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> <[EMAIL PROTECTED]> wrote:
>> > hi,
>> > I have doc files in msword doc and docx format. These have entries which
>> > are
>> > seperated by an empty line. Is it possible for me to read
>> > these lines separated from empty lines at a time. Also which inpurformat
>> > shall I use to read doc docx. Please help
>> >
>> > *------------------------*
>> > Cheers !!!
>> > Siddharth Tiwari
>> > Have a refreshing day !!!
>> > "Every duty is holy, and devotion to duty is the highest form of worship
>> > of
>> > God.”
>> > "Maybe other people will try to limit me but I don't limit myself"
>>
>>
>>
>> --
>> Håvard Wahl Kongsgård
>> Faculty of Medicine &
>> Department of Mathematical Sciences
>> NTNU
>>
>> http://havard.security-review.net/

--
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB