Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # dev >> Partially written SequenceFile

Copy link to this message
Partially written SequenceFile


I am working on 2.0.2-alpha version of Hadoop. I am currently writing my key value pairs on HDFS in a sequence file. I regularly flush my data using hsync() because the process that is writing to the file can terminate abruptly. My requirement is that once my hsync() is successful, my data that was written before hsync() should still be available.

To ensure this, I carried out a test that killed the process (that was writing to a SequenceFile) after this process did a hsync(). Now when I read the data using "hadoop fs -cat" command, I can see the data. But the size of file is 0. Also, SequenceFile.Reader.next(key, value) returns me false. I read somewhere that since file was not closed properly its size was not updated with the namenode and because of the same reason next() returns false.

To fix this and to enable reading of file using SequenceFile APIs, I opened the file stream in append mode and then I closed it immediately. This fixed the size of the file. While doing this, I retry if I receive RecoveryInProgress or AlreadyBeingCreated exception. Now, I can successfully read data using SequenceFile.Reader. Following is the code that I am using.

writer = SequenceFile.createWriter(fs, conf, path, value.getClass(), value.getClass(), CompressionType.NONE);
writer.append(new Text("India"), new Text("Delhi"));
writer.append(new Text("China"), new Text("Beijing"));

*** I expect that India and China Should be available but next returns false***

*** Code to fix the file size ****

while (true) {
try {
FileSystem fs = FileSystem.get(namenodeURI, conf);
Path path = new Path( uri);
FSDataOutputStream open = fs.append(path);
} catch (Recovery In Progress Exception) {
} catch (Already Being Created Exception) {
} catch (Exception) {

Would it be possible for you to let me know if this approach has any shortcomings or if there are any other better alternatives available?

Hemant Bhanawat