Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Reading multiple lines from a microsoft doc in hadoop


Copy link to this message
-
Re: Reading multiple lines from a microsoft doc in hadoop
It's much easier if you convert the documents to text first

use
http://tika.apache.org/

or some other doc parser
-Håvard

On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
<[EMAIL PROTECTED]> wrote:
> hi,
> I have doc files in msword doc and docx format. These have entries which are
> seperated by an empty line. Is it possible for me to read
> these lines separated from empty lines at a time. Also which inpurformat
> shall I use to read doc docx. Please help
>
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"

--
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB