Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> fs cache giving me headaches


Copy link to this message
-
fs cache giving me headaches
nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable
writes code like this:
Final FileSystem fs = FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all
sorts of weird errors where FileSystems in unrelated code (sometimes not
even my code) started misbehaving and streams where unexpectedly shut. Then
i realized that FileSystem uses a cache and close() closes it for everyone!
Not pretty in my opinion, but i can live with it. So i checked other code
and found that basically nobody closes FileSystems. Apparently the expected
way of using FileSystems is to simple never close them. So i adopted this
approach (which i think is really contrary to java conventions for a
Closeable).

Lately i started working on some code for a daemon/server where many
FileSystems objects are created for different users (UGIs) that use the
service. As it turns out other projects have run into trouble with the
FileSystem cache in situations like this (for example, Scribe and Hoop). I
imagine the cache can get very large and cause problems (i have not tested
this myself).

Looking at the code for Hoop i noticed they simply turned off the
FileSystem cache and made sure to close every FileSystem. So here the
suggested approach to deal with FileSystems seems to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or
FileSystem.get(conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any
cache size limitations. However if i adopt this approach i basically can
not re-use any existing code or libraries that do not close FileSystems,
splitting the codebase into two which is pretty ugly. And this code is not
efficient in situations where there are very few used FileSystem objects
and a cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both
situations! Ideally i would have liked fs.close() to do the right thing
depending in the settings: if cache is turned off it closes the FileSystem,
and if it is turned on its a NOOP. That way i could always use
FileSystem.get(conf) and always close my filesystems, and the code would be
usable irrespective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!