Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Reading multiple lines from a microsoft doc in hadoop


Copy link to this message
-
Re: Reading multiple lines from a microsoft doc in hadoop
Bertrand Dechoux 2012-08-24, 06:10
And that would help you with performance too.
Were you originally planning to have one file per word document?
What is the average size of you word documents?
It shouldn't be much. I am afraid your map startup time won't be negligible
in that case.

Regards

Bertrand

On Fri, Aug 24, 2012 at 8:07 AM, Håvard Wahl Kongsgård <
[EMAIL PROTECTED]> wrote:

> It's much easier if you convert the documents to text first
>
> use
> http://tika.apache.org/
>
> or some other doc parser
>
>
> -Håvard
>
> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
> <[EMAIL PROTECTED]> wrote:
> > hi,
> > I have doc files in msword doc and docx format. These have entries which
> are
> > seperated by an empty line. Is it possible for me to read
> > these lines separated from empty lines at a time. Also which inpurformat
> > shall I use to read doc docx. Please help
> >
> > *------------------------*
> > Cheers !!!
> > Siddharth Tiwari
> > Have a refreshing day !!!
> > "Every duty is holy, and devotion to duty is the highest form of worship
> of
> > God.”
> > "Maybe other people will try to limit me but I don't limit myself"
>
>
>
> --
> Håvard Wahl Kongsgård
> Faculty of Medicine &
> Department of Mathematical Sciences
> NTNU
>
> http://havard.security-review.net/
>

--
Bertrand Dechoux