Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: Reading multiple lines from a microsoft doc in hadoop


Copy link to this message
-
Re: Reading multiple lines from a microsoft doc in hadoop
Sorry I forgot the link :
http://hadoopchicago.com/tips-tricks/custom-xmlreader-boris-lublinsky-michael-segel/

Regards,
    Mohammad Tariq

On Fri, Aug 24, 2012 at 1:10 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> Hello Siddharth,
>
>        You can tweak the "NLineInputFormat" as per your requirement and
> use it. It allows us to read a specified no of lines
> unlike "TextInputFormat". Here is a good post by Boris and Michael on
> custom record reader. Also I would suggest you to
> combine similar files together into one bigger file if feasible, as you
> files are very small.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari <
> [EMAIL PROTECTED]> wrote:
>
>> Hi,
>> Thank you for the suggestion. Actually I was using poi to extract text,
>> but since now  I  have so many  documents I thought I will use hadoop
>> directly to parse as well. Average size of each document is around 120 kb.
>> Also I want to read multiple lines from the text until I find a blank line.
>> I do not have any idea ankit how to design custom input format and record
>> reader. Pleaser help with some tutorial tutorial, code or resource around
>> it. I am struggling with the issue. I will be highly grateful. Thank you so
>> much once again
>>
>> > Date: Fri, 24 Aug 2012 08:07:39 +0200
>> > Subject: Re: Reading multiple lines from a microsoft doc in hadoop
>> > From: [EMAIL PROTECTED]
>> > To: [EMAIL PROTECTED]
>> >
>> > It's much easier if you convert the documents to text first
>> >
>> > use
>> > http://tika.apache.org/
>> >
>> > or some other doc parser
>> >
>> >
>> > -Håvard
>> >
>> > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari
>> > <[EMAIL PROTECTED]> wrote:
>> > > hi,
>> > > I have doc files in msword doc and docx format. These have entries
>> which are
>> > > seperated by an empty line. Is it possible for me to read
>> > > these lines separated from empty lines at a time. Also which
>> inpurformat
>> > > shall I use to read doc docx. Please help
>> > >
>> > > *------------------------*
>> > > Cheers !!!
>> > > Siddharth Tiwari
>> > > Have a refreshing day !!!
>> > > "Every duty is holy, and devotion to duty is the highest form of
>> worship of
>> > > God.”
>> > > "Maybe other people will try to limit me but I don't limit myself"
>> >
>> >
>> >
>> > --
>> > Håvard Wahl Kongsgård
>> > Faculty of Medicine &
>> > Department of Mathematical Sciences
>> > NTNU
>> >
>> > http://havard.security-review.net/
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB