Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Reading multiple lines from a microsoft doc in hadoop


Copy link to this message
-
RE: Reading multiple lines from a microsoft doc in hadoop

CAn anybody enlighten me on what could be wrongg ?

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.”

"Maybe other people will try to limit me but I don't limit myself"
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Sat, 25 Aug 2012 05:35:48 +0000

Any help on below would be really appreciated. i am stuck with it

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.”

"Maybe other people will try to limit me but I don't limit myself"
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 20:23:45 +0000

Hi ,

Can anyone please help ?

Thank you in advance

*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.”

"Maybe other people will try to limit me but I don't limit myself"
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: RE: Reading multiple lines from a microsoft doc in hadoop
Date: Fri, 24 Aug 2012 16:22:57 +0000

Hi Team,

Thanks a lot for so many good suggestions. I wrote a custom input format for reading one paragraph at a time. But when I use it I get lines read. Can you please suggest what changes I must make to read one para at a time seperated by null lines ?
below is the code I wrote:-
import java.io.IOException;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.util.LineReader;
/**
 *
 */

/**
 * @author 460615
 *
 */
//FileInputFormat is the base class for all file-based InputFormats
public class ParaInputFormat extends FileInputFormat<LongWritable,Text> {
private String nullRegex = "^\\s*$" ;
public String StrLine = null;
/*public RecordReader<LongWritable, Text> getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(genericSplit.toString());
return new ParaInputFormat(job, (FileSplit)genericSplit);
}*/
public RecordReader<LongWritable, Text> createRecordReader(InputSplit genericSplit, TaskAttemptContext context)throws IOException {
   context.setStatus(genericSplit.toString());
   return new LineRecordReader();
 }
public InputSplit[] getSplits(JobContext job, Configuration conf) throws IOException {
ArrayList<FileSplit> splits = new ArrayList<FileSplit>();
for (FileStatus status : listStatus(job)) {
Path fileName = status.getPath();
if (status.isDir()) {
throw new IOException("Not a file: " + fileName);
}
FileSystem  fs = fileName.getFileSystem(conf);
LineReader lr = null;
try {
FSDataInputStream in  = fs.open(fileName);
lr = new LineReader(in, conf);
// String regexMatch =in.readLine();
Text line = new Text();
long begin = 0;
long length = 0;
int num = -1;
String boolTest = null;
boolean match = false;
Pattern p = Pattern.compile(nullRegex);
// Matcher matcher = new p.matcher();
while ((boolTest = in.readLine()) != null && (num = lr.readLine(line)) > 0 && ! ( in.readLine().isEmpty())){
// numLines++;
length += num;
 
 
splits.add(new FileSplit(fileName, begin, length, new String[]{}));}
begin=length;
}finally {
if (lr != null) {
lr.close();
}
 
 
 
}
 
}
return splits.toArray(new FileSplit[splits.size()]);
}
 
}
*------------------------*

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of God.”

"Maybe other people will try to limit me but I don't limit myself"