Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Re: Reading multiple lines from a microsoft doc in hadoop


Copy link to this message
-
Re: Reading multiple lines from a microsoft doc in hadoop
Mohammad Tariq 2012-08-24, 07:41
Sorry I forgot the link :
http://hadoopchicago.com/tips-tricks/custom-xmlreader-boris-lublinsky-michael-segel/

Regards,
    Mohammad Tariq

On Fri, Aug 24, 2012 at 1:10 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> Hello Siddharth,
>
>        You can tweak the "NLineInputFormat" as per your requirement and
> use it. It allows us to read a specified no of lines
> unlike "TextInputFormat". Here is a good post by Boris and Michael on
> custom record reader. Also I would suggest you to
> combine similar files together into one bigger file if feasible, as you
> files are very small.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari <
> [EMAIL PROTECTED]> wrote:
>
>> Hi,
>> Thank you for the suggestion. Actually I was using poi to extract text,
>> but since now  I  have so many  documents I thought I will use hadoop
>> directly to parse as well. Average size of each document is around 120 kb.
>> Also I want to read multiple lines from the text until I find a blank line.
>> I do not have any idea ankit how to design custom input format and record
>> reader. Pleaser help with some tutorial tutorial, code or resource around
>> it. I am struggling with the issue. I will be highly grateful. Thank you so
>> much once again
>>
>> > Date: Fri, 24 Aug 2012 08:07:39 +0200
>> > Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> > From: [EMAIL PROTECTED]
>> > To: [EMAIL PROTECTED]
>> >
>> > It's much easier if you convert the documents to text first
>> >
>> > use
>> > http://tika.apache.org/
>> >
>> > or some other doc parser
>> >
>> >
>> > -Håvard
>> >
>> > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> > <[EMAIL PROTECTED]> wrote:
>> > > hi,
>> > > I have doc files in msword doc and docx format. These have entries
>> which are
>> > > seperated by an empty line. Is it possible for me to read
>> > > these lines separated from empty lines at a time. Also which
>> inpurformat
>> > > shall I use to read doc docx. Please help
>> > >
>> > > *------------------------*
>> > > Cheers !!!
>> > > Siddharth Tiwari
>> > > Have a refreshing day !!!
>> > > "Every duty is holy, and devotion to duty is the highest form of
>> worship of
>> > > God.”
>> > > "Maybe other people will try to limit me but I don't limit myself"
>> >
>> >
>> >
>> > --
>> > Håvard Wahl Kongsgård
>> > Faculty of Medicine &
>> > Department of Mathematical Sciences
>> > NTNU
>> >
>> > http://havard.security-review.net/
>>
>
>