Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # user - Need for UDP / Multicast Source


+
Andrew Otto 2013-01-14, 17:29
+
Hari Shreedharan 2013-01-14, 17:37
+
Alexander Alten-Lorenz 2013-01-14, 17:43
+
Andrew Otto 2013-01-14, 18:01
+
Andrew Otto 2013-01-15, 19:31
+
Andrew Otto 2013-01-16, 21:22
Copy link to this message
-
Re: Need for UDP / Multicast Source
Brock Noland 2013-01-16, 21:36
Hi,

I would use memory channel for now as opposed to file channel. For
file channel to keep up with that you'd need multiple disks. Also your
checkpoint period is super-low which will cause lots of checkpoints
and slow things down.

However, I think the biggest issue is probably batch size. With that
much data you are likely going to want a large batch size for all
components involved. Something a low multiple of 1000. There is a good
article on this:
https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1

To re-cap would:

Use memory channel for now. Once you prove things work you can work on
tuning file channel (going to write larger batch sizes and multiple
disks).

Increase the batch size for your source/sink.

On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <[EMAIL PROTECTED]> wrote:
> Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream.  This is available to me via UDP Multicast.  Everything seems to be working great, except that I seem to be missing a lot of data.
>
> Our webrequest log stream consists of about 100000 events per second, which amounts to around 50 Mb per second.
>
> I understand that this is probably too much for a single node to handle, but I should be able to either see most of the data written to HDFS, or at least see errors about channels being filled to capacity.  HDFS files are set to roll every 60 seconds.  Each of my files is only about 4.2MB, which is only 72 Kb per second.  That's only 0.14% of the data I'm expecting to consume.
>
> Where did the rest of it go?  If Flume is dropping it, why doesn't it tell me!?
>
> Here's my flume.conf:
>
> https://gist.github.com/4551001
>
>
> Thanks!
>
>
>
>
> On Jan 15, 2013, at 2:31 PM, Andrew Otto <[EMAIL PROTECTED]> wrote:
>
>> I just submitted the patch on https://issues.apache.org/jira/browse/FLUME-1838.
>>
>> Would love some reviews, thanks!
>> -Andrew
>>
>>
>> On Jan 14, 2013, at 1:01 PM, Andrew Otto <[EMAIL PROTECTED]> wrote:
>>
>>> Thanks guys!  I've opened up a JIRA here:
>>>
>>> https://issues.apache.org/jira/browse/FLUME-1838
>>>
>>>
>>> On Jan 14, 2013, at 12:43 PM, Alexander Alten-Lorenz <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hey Andrew,
>>>>
>>>> for your reference, we have a lot of developer informations in our wiki:
>>>>
>>>> https://cwiki.apache.org/confluence/display/FLUME/Developer+Section
>>>> https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet
>>>>
>>>> cheers,
>>>> Alex
>>>>
>>>> On Jan 14, 2013, at 6:37 PM, Hari Shreedharan <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>> Really happy to hear Wikimedia Foundation is considering Flume. I am fairly sure that if you find such a source useful, there would definitely be others who find it useful too. I'd recommend filing a jira and starting a discussion, and then submitting the patch. We would be happy to review and commit it.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Hari
>>>>>
>>>>> --
>>>>> Hari Shreedharan
>>>>>
>>>>>
>>>>> On Monday, January 14, 2013 at 9:29 AM, Andrew Otto wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm an Systems Engineer at the Wikimedia Foundation, and we're investigating using Flume for our web request log HDFS imports. We've previously been using Kafka, but have had to change short term architecture plans in order to get data into HDFS reliably and regularly soon.
>>>>>>
>>>>>> Our current web request logs are available for consumption over a multicast UDP stream. I could hack something together to try and pipe this into Flume using the existing sources (SyslogUDPSource, or maybe some combination of socat + NetcatSource), but I'd rather reduce the number of moving parts. I'd like to consume directly from the multicast UDP stream as a Flume source.
>>>>>>
>>>>>> I coded up proof of concept based on the SyslogUDPSource, mainly just stripping out the syslog event header extraction, and adding in multicast Datagram connection code. I plan on cleaning this up, and making this a generic raw UDP source, with multicast being a configuration option.

Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
+
Andrew Otto 2013-01-16, 22:30
+
Brock Noland 2013-01-16, 22:34
+
Hari Shreedharan 2013-01-16, 22:47
+
Andrew Otto 2013-01-16, 23:03
+
Hari Shreedharan 2013-01-16, 23:09
+
Bhaskar V. Karambelkar 2013-01-17, 01:21
+
Andrew Otto 2013-01-17, 15:34
+
Andrew Otto 2013-01-17, 16:26
+
Andrew Otto 2013-01-17, 17:36
+
Jeff Lord 2013-01-17, 17:59
+
Brock Noland 2013-01-17, 18:04
+
Andrew Otto 2013-01-17, 18:56
+
Andrew Otto 2013-01-17, 17:33