Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: Read hflushed data without reopen file


+
rp@... 2013-12-27, 11:35
+
Chao Shi 2013-12-27, 11:34
+
Yanbo Liang 2013-12-27, 15:54
Copy link to this message
-
Re: Read hflushed data without reopen file
Hi Yanbo,

> Write edits log to HBase, additions are appended to the end of the WAL
file rather than reopen the HDFS file second.

I mean the HBase replication procedure. After edits are appended to HLog, I
think there is a background thread periodically polls new edits from HLog
and syncs them to slave clusters.
2013/12/27 Yanbo Liang <[EMAIL PROTECTED]>

> Hi Chao,
> As far as I know, if client B opens the file which is under construction,
>  the DFSInputStream will get the LocatedBlocks object and it contains a
> member variable which called "underConstruction" to mark this file is under
> construction.
> If the file is reopen, the client will get a different length. I think
> this is make sense because that the file is no longer the old one but one
> with new append data.
>
> Write edits log to HBase, additions are appended to the end of the WAL
> file rather than reopen the HDFS file second.
>
>
> 2013/12/27 Chao Shi <[EMAIL PROTECTED]>
>
>> Hi users,
>>
>> Suppose a client A opens /f and keep appending data then hflushing.
>> Another client B opens this file for read. I found that B can only see the
>> snapshot of data at the time he opens the file. (After B's opening, A may
>> continue to write more data. B cannot see it unless reopen.)
>>
>> Looking into the code, I think this is because DFSInputStream maintains a
>> file length and simply report EOF when we read beyond the file length. The
>> file length is updated and thus the client has a chance to see longer file
>> when:
>> 1) the file is open
>> 2) no live DNs to read from (correct? not very sure.)
>>
>> I think such behaviour is inconsistent. Clients may see a sudden change
>> of file length. I guess a better behaviour is to always try to read beyond
>> the known file length at client-side and let the DN  to return EOF if no
>> more data. In this way, the client B can continue to see what A wrote and
>> hflushed.
>>
>> A real use case for this is HBase log replication. In the region server,
>> there is a background thread keep polling for new HLog entries. It has to
>> reopen every second. This may put a pressure on NN if the number of region
>> servers gets larger.
>>
>> Please correct me if there is anything wrong.
>>
>> Thanks,
>> Chao
>>
>
>
+
rp@... 2013-12-30, 02:22
+
rp@... 2013-12-30, 02:22
+
rp@... 2013-12-30, 02:23
+
rp@... 2013-12-30, 02:26
+
rp@... 2013-12-30, 02:25
+
rp@... 2013-12-30, 02:25
+
rp@... 2013-12-30, 02:25
+
rp@... 2013-12-30, 02:24
+
rp@... 2013-12-30, 02:24
+
rp@... 2013-12-30, 02:23
+
rp@... 2013-12-30, 02:20
+
rp@... 2013-12-30, 02:20
+
rp@... 2013-12-30, 02:19
+
rp@... 2013-12-30, 02:19
+
rp@... 2013-12-30, 02:19
+
rp@... 2013-12-30, 02:18
+
rp@... 2013-12-30, 02:21
+
rp@... 2013-12-30, 02:21
+
rp@... 2013-12-30, 02:26
+
rp@... 2013-12-30, 02:27