Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Reading multiple lines from a microsoft doc in hadoop


Copy link to this message
-
Re: Reading multiple lines from a microsoft doc in hadoop
Håvard Wahl Kongsgård 2012-08-24, 06:07
It's much easier if you convert the documents to text first

use
http://tika.apache.org/

or some other doc parser
-Håvard

On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
<[EMAIL PROTECTED]> wrote:
> hi,
> I have doc files in msword doc and docx format. These have entries which are
> seperated by an empty line. Is it possible for me to read
> these lines separated from empty lines at a time. Also which inpurformat
> shall I use to read doc docx. Please help
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"

--
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/