Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Writing click stream data to hadoop


Copy link to this message
-
Re: Writing click stream data to hadoop
SequenceFile.Writer#syncFs is in Hadoop 1.0.0 (actually since
0.20.205), which calls the underlying FSDataOutputStream#sync which is
actually hflush semantically (data not durable in case of data center
wide power outage). hsync implementation is not yet in 2.0. HDFS-744
just brought hsync in trunk.

__Luke

On Fri, May 25, 2012 at 9:30 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> Mohit,
>
> Not if you call sync (or hflush/hsync in 2.0) periodically to persist
> your changes to the file. SequenceFile doesn't currently have a
> sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
> underlying output stream instead at the moment. This is possible to do
> in 1.0 (just own the output stream).
>
> Your use case also sounds like you may want to simply use Apache Flume
> (Incubating) [http://incubator.apache.org/flume/] that already does
> provide these features and the WAL-kinda reliability you seek.
>
> On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote:
>> We get click data through API calls. I now need to send this data to our
>> hadoop environment. I am wondering if I could open one sequence file and
>> write to it until it's of certain size. Once it's over the specified size I
>> can close that file and open a new one. Is this a good approach?
>>
>> Only thing I worry about is what happens if the server crashes before I am
>> able to cleanly close the file. Would I lose all previous data?
>
>
>
> --
> Harsh J