|
|
-
Re: Reading multiple lines from a microsoft doc in hadoopHåvard Wahl Kongsgård 2012-08-24, 07:54
Hi, maybe you should check out the old nutch project
http://nutch.apache.org/ (hadoop was developed for nutch). It's a web crawler and indexer, but the malinglists hold much info doc/pdf parsing which also relates to hadoop. Have never parsed many docx or doc files, but it should be strait-forward. But generally for text analysis preprocessing is the KEY! For example replace dual lines \r\n\r\n or (\n\n) with #### is a simple trick) -Håvard On Fri, Aug 24, 2012 at 9:30 AM, Siddharth Tiwari <[EMAIL PROTECTED]> wrote: > Hi, > Thank you for the suggestion. Actually I was using poi to extract text, but > since now I have so many documents I thought I will use hadoop directly > to parse as well. Average size of each document is around 120 kb. Also I > want to read multiple lines from the text until I find a blank line. I do > not have any idea ankit how to design custom input format and record reader. > Pleaser help with some tutorial tutorial, code or resource around it. I am > struggling with the issue. I will be highly grateful. Thank you so much once > again > >> Date: Fri, 24 Aug 2012 08:07:39 +0200 >> Subject: Re: Reading multiple lines from a microsoft doc in hadoop >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] > >> >> It's much easier if you convert the documents to text first >> >> use >> http://tika.apache.org/ >> >> or some other doc parser >> >> >> -Håvard >> >> On Fri, Aug 24, 2012 at 7:52 AM, Siddharth Tiwari >> <[EMAIL PROTECTED]> wrote: >> > hi, >> > I have doc files in msword doc and docx format. These have entries which >> > are >> > seperated by an empty line. Is it possible for me to read >> > these lines separated from empty lines at a time. Also which inpurformat >> > shall I use to read doc docx. Please help >> > >> > *------------------------* >> > Cheers !!! >> > Siddharth Tiwari >> > Have a refreshing day !!! >> > "Every duty is holy, and devotion to duty is the highest form of worship >> > of >> > God.” >> > "Maybe other people will try to limit me but I don't limit myself" >> >> >> >> -- >> Håvard Wahl Kongsgård >> Faculty of Medicine & >> Department of Mathematical Sciences >> NTNU >> >> http://havard.security-review.net/ -- Håvard Wahl Kongsgård Faculty of Medicine & Department of Mathematical Sciences NTNU http://havard.security-review.net/ +
Mohammad Tariq 2012-08-24, 07:40
+
Mohammad Tariq 2012-08-24, 07:41
|