|
|
-
Re: Reading multiple lines from a microsoft doc in hadoopMohammad Tariq 2012-08-24, 07:41
Sorry I forgot the link :
http://hadoopchicago.com/tips-tricks/custom-xmlreader-boris-lublinsky-michael-segel/ Regards, Mohammad Tariq On Fri, Aug 24, 2012 at 1:10 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Hello Siddharth, > > You can tweak the "NLineInputFormat" as per your requirement and > use it. It allows us to read a specified no of lines > unlike "TextInputFormat". Here is a good post by Boris and Michael on > custom record reader. Also I would suggest you to > combine similar files together into one bigger file if feasible, as you > files are very small. > > Regards, > Mohammad Tariq > > > > On Fri, Aug 24, 2012 at 1:00 PM, Siddharth Tiwari < > [EMAIL PROTECTED]> wrote: > >> Hi, >> Thank you for the suggestion. Actually I was using poi to extract text, >> but since now I have so many documents I thought I will use hadoop >> directly to parse as well. Average size of each document is around 120 kb. >> Also I want to read multiple lines from the text until I find a blank line. >> I do not have any idea ankit how to design custom input format and record >> reader. Pleaser help with some tutorial tutorial, code or resource around >> it. I am struggling with the issue. I will be highly grateful. Thank you so >> much once again >> >> > Date: Fri, 24 Aug 2012 08:07:39 +0200 >> > Subject: Re: Reading multiple lines from a microsoft doc in hadoop >> > From: [EMAIL PROTECTED] >> > To: [EMAIL PROTECTED] >> > >> > It's much easier if you convert the documents to text first >> > >> > use >> > http://tika.apache.org/ >> > >> > or some other doc parser >> > >> > >> > -Håvard >> > >> > On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari >> > <[EMAIL PROTECTED]> wrote: >> > > hi, >> > > I have doc files in msword doc and docx format. These have entries >> which are >> > > seperated by an empty line. Is it possible for me to read >> > > these lines separated from empty lines at a time. Also which >> inpurformat >> > > shall I use to read doc docx. Please help >> > > >> > > *------------------------* >> > > Cheers !!! >> > > Siddharth Tiwari >> > > Have a refreshing day !!! >> > > "Every duty is holy, and devotion to duty is the highest form of >> worship of >> > > God.” >> > > "Maybe other people will try to limit me but I don't limit myself" >> > >> > >> > >> > -- >> > Håvard Wahl Kongsgård >> > Faculty of Medicine & >> > Department of Mathematical Sciences >> > NTNU >> > >> > http://havard.security-review.net/ >> > > |