Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Writing click stream data to hadoop


Copy link to this message
-
Re: Writing click stream data to hadoop
I cc'd [EMAIL PROTECTED] because I don't know if Mohit subscribed there.

Mohit,

you could use Avro to serialize the data and send them to a Flume Avro source. Or you could syslog - both are supported in Flume 1.x.
http://archive.cloudera.com/cdh/3/flume-ng-1.1.0-cdh3u4/FlumeUserGuide.html

An exec-source is also possible, please note, flume will only start / use the command you configured and didn't take control over the whole process.

- Alex

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

On May 30, 2012, at 4:56 PM, Mohit Anchlia wrote:

> On Fri, May 25, 2012 at 9:30 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>
>> Mohit,
>>
>> Not if you call sync (or hflush/hsync in 2.0) periodically to persist
>> your changes to the file. SequenceFile doesn't currently have a
>> sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
>> underlying output stream instead at the moment. This is possible to do
>> in 1.0 (just own the output stream).
>>
>> Your use case also sounds like you may want to simply use Apache Flume
>> (Incubating) [http://incubator.apache.org/flume/] that already does
>> provide these features and the WAL-kinda reliability you seek.
>>
>
> Thanks Harsh, Does flume also provides API on top. I am getting this data
> as http call, how would I go about using flume with http calls?
>
>>
>> On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia <[EMAIL PROTECTED]>
>> wrote:
>>> We get click data through API calls. I now need to send this data to our
>>> hadoop environment. I am wondering if I could open one sequence file and
>>> write to it until it's of certain size. Once it's over the specified
>> size I
>>> can close that file and open a new one. Is this a good approach?
>>>
>>> Only thing I worry about is what happens if the server crashes before I
>> am
>>> able to cleanly close the file. Would I lose all previous data?
>>
>>
>>
>> --
>> Harsh J
>>