Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Inputformat


Hi,

  I am using one of the libraries which rely on InputFormat.
Right now, it is reading xml files spanning across mutiple lines.
So currently the input format is like:

public class XMLInputReader extends FileInputFormat<LongWritable, Text> {

  public static final String START_TAG = "<page>";
  public static final String END_TAG = "</page>";

  @Override
  public RecordReader<LongWritable, Text> getRecordReader(InputSplit split,
      JobConf conf, Reporter reporter) throws IOException {
    conf.set(XMLInputFormat.START_TAG_KEY, START_TAG);
    conf.set(XMLInputFormat.END_TAG_KEY, END_TAG);
    return new XMLRecordReader((FileSplit) split, conf);
  }
}
So, in above if the data is like:

<page>

 soemthing \n
somthing \n

</page>

It process this sort of data..
Now, i want to use the same framework but for json files but lasting just
single line..

So I guess my
my START_TAG can be "{"

Will my END_TAG be "}\n"

it can't be "}" as there can be nested json in this data?

Any clues
Thanks