Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: Read hflushed data without reopen file


+
rp@... 2013-12-27, 11:35
+
Chao Shi 2013-12-27, 11:34
+
Yanbo Liang 2013-12-27, 15:54
Copy link to this message
-
Re: Read hflushed data without reopen file
Hi Yanbo,

> Write edits log to HBase, additions are appended to the end of the WAL
file rather than reopen the HDFS file second.

I mean the HBase replication procedure. After edits are appended to HLog, I
think there is a background thread periodically polls new edits from HLog
and syncs them to slave clusters.
2013/12/27 Yanbo Liang <[EMAIL PROTECTED]>

> Hi Chao,
> As far as I know, if client B opens the file which is under construction,
>  the DFSInputStream will get the LocatedBlocks object and it contains a
> member variable which called "underConstruction" to mark this file is under
> construction.
> If the file is reopen, the client will get a different length. I think
> this is make sense because that the file is no longer the old one but one
> with new append data.
>
> Write edits log to HBase, additions are appended to the end of the WAL
> file rather than reopen the HDFS file second.
>
>
> 2013/12/27 Chao Shi <[EMAIL PROTECTED]>
>
>> Hi users,
>>
>> Suppose a client A opens /f and keep appending data then hflushing.
>> Another client B opens this file for read. I found that B can only see the
>> snapshot of data at the time he opens the file. (After B's opening, A may
>> continue to write more data. B cannot see it unless reopen.)
>>
>> Looking into the code, I think this is because DFSInputStream maintains a
>> file length and simply report EOF when we read beyond the file length. The
>> file length is updated and thus the client has a chance to see longer file
>> when:
>> 1) the file is open
>> 2) no live DNs to read from (correct? not very sure.)
>>
>> I think such behaviour is inconsistent. Clients may see a sudden change
>> of file length. I guess a better behaviour is to always try to read beyond
>> the known file length at client-side and let the DN  to return EOF if no
>> more data. In this way, the client B can continue to see what A wrote and
>> hflushed.
>>
>> A real use case for this is HBase log replication. In the region server,
>> there is a background thread keep polling for new HLog entries. It has to
>> reopen every second. This may put a pressure on NN if the number of region
>> servers gets larger.
>>
>> Please correct me if there is anything wrong.
>>
>> Thanks,
>> Chao
>>
>
>
+
rp@... 2013-12-30, 02:22
+
rp@... 2013-12-30, 02:22
+
rp@... 2013-12-30, 02:23
+
rp@... 2013-12-30, 02:26
+
rp@... 2013-12-30, 02:25
+
rp@... 2013-12-30, 02:25
+
rp@... 2013-12-30, 02:25
+
rp@... 2013-12-30, 02:24
+
rp@... 2013-12-30, 02:24
+
rp@... 2013-12-30, 02:23
+
rp@... 2013-12-30, 02:20
+
rp@... 2013-12-30, 02:20
+
rp@... 2013-12-30, 02:19
+
rp@... 2013-12-30, 02:19
+
rp@... 2013-12-30, 02:19
+
rp@... 2013-12-30, 02:18
+
rp@... 2013-12-30, 02:21
+
rp@... 2013-12-30, 02:21
+
rp@... 2013-12-30, 02:26
+
rp@... 2013-12-30, 02:27
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB