Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # dev >> FSDataInputStream.read returns -1 with growing file and never continues reading


Copy link to this message
-
Re: FSDataInputStream.read returns -1 with growing file and never continues reading
Thank you, Harsh. I appreciate it.

2012/12/20 Harsh J <[EMAIL PROTECTED]>

> Hi Christoph,
>
> If you use sync/hflush/hsync, the new length of data is only seen by a
> new reader, not an existent reader. The "workaround" you've done
> exactly how we've implemented the "fs -tail <file>" utility. See code
> for that at
> http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/Tail.java?view=markup
> (Note the looping at ~74).
>
> On Thu, Dec 20, 2012 at 5:51 PM, Christoph Rupp <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I am experiencing an unexpected situation where FSDataInputStream.read()
> > returns -1 while reading data from a file that another process still
> appends
> > to. According to the documentation read() should never return -1 but
> throw
> > Exceptions on errors. In addition, there's more data available, and
> read()
> > definitely should not fail.
> >
> > The problem gets worse because the FSDataInputStream is not able to
> recover
> > from this. If it once returns -1 then it will always return -1, even if
> the
> > file continues growing.
> >
> > If, at the same time, other Java processes read other HDFS files, they
> will
> > also return -1 immediately after opening the file. It smells like this
> error
> > gets propagated to other client processes as well.
> >
> > I found a workaround: close the FSDataInputStream, open it again and then
> > seek to the previous position. And then reading works fine.
> >
> > Another problem that i have seen is that the FSDataInputStream returns -1
> > when reaching EOF. It will never return 0 (which i would expect when
> > reaching EOF).
> >
> > I use CDH 4.1.2, but also saw this with CDH 3u5. I have attached samples
> to
> > reproduce this.
> >
> > My cluster consists of 4 machines; 1 namenode and 3 datanodes. I run my
> > tests on the namenode machine. there are no other HDFS users, and the
> load
> > that is generated by my tests is fairly low, i would say.
> >
> > One process writes to 6 files simultaneously, but with a 5 sec sleep
> between
> > each write. It uses an FSDataOutputStream, and after writing data it
> calls
> > sync(). Each write() appends 8 mb; it stops when the file grows to 100
> mb.
> >
> > Six processes read files; each process reads one file. At first each
> reader
> > loops till the file exists. If it does then it opens the
> FSDataInputStream
> > and starts reading. Usually the first process returns the first 8 MB in
> the
> > file before it starts returning -1. But the other processes immediately
> > return -1 without reading any data. I start the 6 reader processes
> before i
> > start the writer.
> >
> > Search HdfsReader.java for "WORKAROUND" and remove the comments; this
> will
> > reopen the FSDataInputStream after -1 is returned, and then everything
> > works.
> >
> > Sources are attached.
> >
> > This is a very basic scenario and i wonder if i'm doing anything wrong
> or if
> > i found an HDFS bug.
> >
> > bye
> > Christoph
> >
>
>
>
> --
> Harsh J
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB