Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Read hflushed data without reopen file


Copy link to this message
-
Read hflushed data without reopen file
Hi users,

Suppose a client A opens /f and keep appending data then hflushing. Another
client B opens this file for read. I found that B can only see the snapshot
of data at the time he opens the file. (After B's opening, A may continue
to write more data. B cannot see it unless reopen.)

Looking into the code, I think this is because DFSInputStream maintains a
file length and simply report EOF when we read beyond the file length. The
file length is updated and thus the client has a chance to see longer file
when:
1) the file is open
2) no live DNs to read from (correct? not very sure.)

I think such behaviour is inconsistent. Clients may see a sudden change of
file length. I guess a better behaviour is to always try to read beyond the
known file length at client-side and let the DN  to return EOF if no more
data. In this way, the client B can continue to see what A wrote and
hflushed.

A real use case for this is HBase log replication. In the region server,
there is a background thread keep polling for new HLog entries. It has to
reopen every second. This may put a pressure on NN if the number of region
servers gets larger.

Please correct me if there is anything wrong.

Thanks,
Chao