|
Mohit Anchlia
2012-05-25, 14:54
Harsh J
2012-05-25, 16:30
Mohit Anchlia
2012-05-30, 14:56
alo alt
2012-05-30, 15:09
Luke Lu
2012-05-31, 01:24
Harsh J
2012-05-31, 02:37
|
-
Writing click stream data to hadoopMohit Anchlia 2012-05-25, 14:54
We get click data through API calls. I now need to send this data to our
hadoop environment. I am wondering if I could open one sequence file and write to it until it's of certain size. Once it's over the specified size I can close that file and open a new one. Is this a good approach? Only thing I worry about is what happens if the server crashes before I am able to cleanly close the file. Would I lose all previous data?
-
Re: Writing click stream data to hadoopHarsh J 2012-05-25, 16:30
Mohit,
Not if you call sync (or hflush/hsync in 2.0) periodically to persist your changes to the file. SequenceFile doesn't currently have a sync-API inbuilt in it (in 1.0 at least), but you can call sync on the underlying output stream instead at the moment. This is possible to do in 1.0 (just own the output stream). Your use case also sounds like you may want to simply use Apache Flume (Incubating) [http://incubator.apache.org/flume/] that already does provide these features and the WAL-kinda reliability you seek. On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > We get click data through API calls. I now need to send this data to our > hadoop environment. I am wondering if I could open one sequence file and > write to it until it's of certain size. Once it's over the specified size I > can close that file and open a new one. Is this a good approach? > > Only thing I worry about is what happens if the server crashes before I am > able to cleanly close the file. Would I lose all previous data? -- Harsh J
-
Re: Writing click stream data to hadoopMohit Anchlia 2012-05-30, 14:56
On Fri, May 25, 2012 at 9:30 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> Mohit, > > Not if you call sync (or hflush/hsync in 2.0) periodically to persist > your changes to the file. SequenceFile doesn't currently have a > sync-API inbuilt in it (in 1.0 at least), but you can call sync on the > underlying output stream instead at the moment. This is possible to do > in 1.0 (just own the output stream). > > Your use case also sounds like you may want to simply use Apache Flume > (Incubating) [http://incubator.apache.org/flume/] that already does > provide these features and the WAL-kinda reliability you seek. > Thanks Harsh, Does flume also provides API on top. I am getting this data as http call, how would I go about using flume with http calls? > > On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia <[EMAIL PROTECTED]> > wrote: > > We get click data through API calls. I now need to send this data to our > > hadoop environment. I am wondering if I could open one sequence file and > > write to it until it's of certain size. Once it's over the specified > size I > > can close that file and open a new one. Is this a good approach? > > > > Only thing I worry about is what happens if the server crashes before I > am > > able to cleanly close the file. Would I lose all previous data? > > > > -- > Harsh J >
-
Re: Writing click stream data to hadoopalo alt 2012-05-30, 15:09
I cc'd [EMAIL PROTECTED] because I don't know if Mohit subscribed there.
Mohit, you could use Avro to serialize the data and send them to a Flume Avro source. Or you could syslog - both are supported in Flume 1.x. http://archive.cloudera.com/cdh/3/flume-ng-1.1.0-cdh3u4/FlumeUserGuide.html An exec-source is also possible, please note, flume will only start / use the command you configured and didn't take control over the whole process. - Alex -- Alexander Alten-Lorenz http://mapredit.blogspot.com German Hadoop LinkedIn Group: http://goo.gl/N8pCF On May 30, 2012, at 4:56 PM, Mohit Anchlia wrote: > On Fri, May 25, 2012 at 9:30 AM, Harsh J <[EMAIL PROTECTED]> wrote: > >> Mohit, >> >> Not if you call sync (or hflush/hsync in 2.0) periodically to persist >> your changes to the file. SequenceFile doesn't currently have a >> sync-API inbuilt in it (in 1.0 at least), but you can call sync on the >> underlying output stream instead at the moment. This is possible to do >> in 1.0 (just own the output stream). >> >> Your use case also sounds like you may want to simply use Apache Flume >> (Incubating) [http://incubator.apache.org/flume/] that already does >> provide these features and the WAL-kinda reliability you seek. >> > > Thanks Harsh, Does flume also provides API on top. I am getting this data > as http call, how would I go about using flume with http calls? > >> >> On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia <[EMAIL PROTECTED]> >> wrote: >>> We get click data through API calls. I now need to send this data to our >>> hadoop environment. I am wondering if I could open one sequence file and >>> write to it until it's of certain size. Once it's over the specified >> size I >>> can close that file and open a new one. Is this a good approach? >>> >>> Only thing I worry about is what happens if the server crashes before I >> am >>> able to cleanly close the file. Would I lose all previous data? >> >> >> >> -- >> Harsh J >>
-
Re: Writing click stream data to hadoopLuke Lu 2012-05-31, 01:24
SequenceFile.Writer#syncFs is in Hadoop 1.0.0 (actually since
0.20.205), which calls the underlying FSDataOutputStream#sync which is actually hflush semantically (data not durable in case of data center wide power outage). hsync implementation is not yet in 2.0. HDFS-744 just brought hsync in trunk. __Luke On Fri, May 25, 2012 at 9:30 AM, Harsh J <[EMAIL PROTECTED]> wrote: > Mohit, > > Not if you call sync (or hflush/hsync in 2.0) periodically to persist > your changes to the file. SequenceFile doesn't currently have a > sync-API inbuilt in it (in 1.0 at least), but you can call sync on the > underlying output stream instead at the moment. This is possible to do > in 1.0 (just own the output stream). > > Your use case also sounds like you may want to simply use Apache Flume > (Incubating) [http://incubator.apache.org/flume/] that already does > provide these features and the WAL-kinda reliability you seek. > > On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: >> We get click data through API calls. I now need to send this data to our >> hadoop environment. I am wondering if I could open one sequence file and >> write to it until it's of certain size. Once it's over the specified size I >> can close that file and open a new one. Is this a good approach? >> >> Only thing I worry about is what happens if the server crashes before I am >> able to cleanly close the file. Would I lose all previous data? > > > > -- > Harsh J
-
Re: Writing click stream data to hadoopHarsh J 2012-05-31, 02:37
Thanks for correcting me there on the syncFs call Luke. I seemed to
have missed that method when searching branch-1 code. On Thu, May 31, 2012 at 6:54 AM, Luke Lu <[EMAIL PROTECTED]> wrote: > > SequenceFile.Writer#syncFs is in Hadoop 1.0.0 (actually since > 0.20.205), which calls the underlying FSDataOutputStream#sync which is > actually hflush semantically (data not durable in case of data center > wide power outage). hsync implementation is not yet in 2.0. HDFS-744 > just brought hsync in trunk. > > __Luke > > On Fri, May 25, 2012 at 9:30 AM, Harsh J <[EMAIL PROTECTED]> wrote: > > Mohit, > > > > Not if you call sync (or hflush/hsync in 2.0) periodically to persist > > your changes to the file. SequenceFile doesn't currently have a > > sync-API inbuilt in it (in 1.0 at least), but you can call sync on the > > underlying output stream instead at the moment. This is possible to do > > in 1.0 (just own the output stream). > > > > Your use case also sounds like you may want to simply use Apache Flume > > (Incubating) [http://incubator.apache.org/flume/] that already does > > provide these features and the WAL-kinda reliability you seek. > > > > On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote: > >> We get click data through API calls. I now need to send this data to our > >> hadoop environment. I am wondering if I could open one sequence file and > >> write to it until it's of certain size. Once it's over the specified size I > >> can close that file and open a new one. Is this a good approach? > >> > >> Only thing I worry about is what happens if the server crashes before I am > >> able to cleanly close the file. Would I lose all previous data? > > > > > > > > -- > > Harsh J -- Harsh J |