Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Read hflushed data without reopen file


Copy link to this message
-
Read hflushed data without reopen file
Hi users,

Suppose a client A opens /f and keep appending data then hflushing. Another
client B opens this file for read. I found that B can only see the snapshot
of data at the time he opens the file. (After B's opening, A may continue
to write more data. B cannot see it unless reopen.)

Looking into the code, I think this is because DFSInputStream maintains a
file length and simply report EOF when we read beyond the file length. The
file length is updated and thus the client has a chance to see longer file
when:
1) the file is open
2) no live DNs to read from (correct? not very sure.)

I think such behaviour is inconsistent. Clients may see a sudden change of
file length. I guess a better behaviour is to always try to read beyond the
known file length at client-side and let the DN  to return EOF if no more
data. In this way, the client B can continue to see what A wrote and
hflushed.

A real use case for this is HBase log replication. In the region server,
there is a background thread keep polling for new HLog entries. It has to
reopen every second. This may put a pressure on NN if the number of region
servers gets larger.

Please correct me if there is anything wrong.

Thanks,
Chao
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB