Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> How to use HarFileSystem?


Copy link to this message
-
How to use HarFileSystem?
Dear list,
I'm rather new to HDFS and I am trying to figure out how to use the
HarFileSystem class. I have created a little sample Harchive for testing
purpose that looks like this:

=============================================================$ bin/hadoop fs -ls har:///WPD.har/00001
Found 8 items
-rw-r--r--   1 schnober supergroup       6516 2012-08-15 17:53
/WPD.har/00001/text.xml
-rw-r--r--   1 schnober supergroup        471 2012-08-15 17:53
/WPD.har/00001/metadata.xml
drwxr-xr-x   - schnober supergroup          0 2012-08-15 17:53
/WPD.har/00001/xip
drwxr-xr-x   - schnober supergroup          0 2012-08-15 17:53
/WPD.har/00001/connexor
drwxr-xr-x   - schnober supergroup          0 2012-08-15 17:53
/WPD.har/00001/base
-rw-r--r--   1 schnober supergroup       3728 2012-08-15 17:53
/WPD.har/00001/header.xml
-rw-r--r--   1 schnober supergroup       6209 2012-08-15 17:53
/WPD.har/00001/text.txt
drwxr-xr-x   - schnober supergroup          0 2012-08-15 17:53
/WPD.har/00001/tree_tagger
=============================================================
Now, I am trying to read the files contained in that archive
programmatically with the following Java code:

=============================================================FileSystem hdfs;
HarFileSystem harfs;
Path dir = new Path("har:///WPD.har/00001");
Configuration conf = new Configuration();
conf.addResource(new
Path("/home/schnober/hadoop-1.0.3/conf/core-site.xml"));
System.out.println(conf.get("fs.default.name"));
FileStatus[] files;
FSDataInputStream in;

try {
  hdfs = FileSystem.get(conf);
  harfs = new HarFileSystem(hdfs);
  files = harfs.listStatus(dir);
  System.err.println("Reading "+files.length+" files in "+dir);

  for (FileStatus file : files) {
    if (file.isDir())
      continue;
    byte[] buffer = new byte[(int) file.getLen()];
    in = harfs.open(file.getPath());
    in.read(buffer);
    System.out.println(new String(buffer));
    in.close();
  }
} catch (IOException e) {
  e.printStackTrace();
}
=============================================================
However, a NullPointerException is thrown when harfs.listStatus(dir) is
executed. I suppose this means that 'dir' allegedly does not exist as
stated in the Javadoc for HarFilesSystem.listStatus(): "returns null, if
Path f does not exist in the FileSystem."

I've tried numerous variations like omitting the path within the HAR
archive, but apparently, the HAR archive still cannot be read. I am able
to read the HDFS filesystem though with the same configuration using the
FileSystem class.

I assume that I'm just not aware of how to use the HarFileSystem class
correctly, but I haven't been able to find more detailed explanations or
examples; maybe a pointer to some sample code would already help me.
Thank you very much!
Carsten

--
Institut fï¿œr Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | [EMAIL PROTECTED]
Korpusanalyseplattform der nï¿œchsten Generation
Next Generation Corpus Analysis Platform