Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> How to use HarFileSystem?


Copy link to this message
-
How to use HarFileSystem?
Dear list,
I'm rather new to HDFS and I am trying to figure out how to use the
HarFileSystem class. I have created a little sample Harchive for testing
purpose that looks like this:

=============================================================$ bin/hadoop fs -ls har:///WPD.har/00001
Found 8 items
-rw-r--r--   1 schnober supergroup       6516 2012-08-15 17:53
/WPD.har/00001/text.xml
-rw-r--r--   1 schnober supergroup        471 2012-08-15 17:53
/WPD.har/00001/metadata.xml
drwxr-xr-x   - schnober supergroup          0 2012-08-15 17:53
/WPD.har/00001/xip
drwxr-xr-x   - schnober supergroup          0 2012-08-15 17:53
/WPD.har/00001/connexor
drwxr-xr-x   - schnober supergroup          0 2012-08-15 17:53
/WPD.har/00001/base
-rw-r--r--   1 schnober supergroup       3728 2012-08-15 17:53
/WPD.har/00001/header.xml
-rw-r--r--   1 schnober supergroup       6209 2012-08-15 17:53
/WPD.har/00001/text.txt
drwxr-xr-x   - schnober supergroup          0 2012-08-15 17:53
/WPD.har/00001/tree_tagger
=============================================================
Now, I am trying to read the files contained in that archive
programmatically with the following Java code:

=============================================================FileSystem hdfs;
HarFileSystem harfs;
Path dir = new Path("har:///WPD.har/00001");
Configuration conf = new Configuration();
conf.addResource(new
Path("/home/schnober/hadoop-1.0.3/conf/core-site.xml"));
System.out.println(conf.get("fs.default.name"));
FileStatus[] files;
FSDataInputStream in;

try {
  hdfs = FileSystem.get(conf);
  harfs = new HarFileSystem(hdfs);
  files = harfs.listStatus(dir);
  System.err.println("Reading "+files.length+" files in "+dir);

  for (FileStatus file : files) {
    if (file.isDir())
      continue;
    byte[] buffer = new byte[(int) file.getLen()];
    in = harfs.open(file.getPath());
    in.read(buffer);
    System.out.println(new String(buffer));
    in.close();
  }
} catch (IOException e) {
  e.printStackTrace();
}
=============================================================
However, a NullPointerException is thrown when harfs.listStatus(dir) is
executed. I suppose this means that 'dir' allegedly does not exist as
stated in the Javadoc for HarFilesSystem.listStatus(): "returns null, if
Path f does not exist in the FileSystem."

I've tried numerous variations like omitting the path within the HAR
archive, but apparently, the HAR archive still cannot be read. I am able
to read the HDFS filesystem though with the same configuration using the
FileSystem class.

I assume that I'm just not aware of how to use the HarFileSystem class
correctly, but I haven't been able to find more detailed explanations or
examples; maybe a pointer to some sample code would already help me.
Thank you very much!
Carsten

--
Institut fï¿œr Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | [EMAIL PROTECTED]
Korpusanalyseplattform der nï¿œchsten Generation
Next Generation Corpus Analysis Platform
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB