|
Andrew Otto
2013-01-14, 17:29
Hari Shreedharan
2013-01-14, 17:37
Alexander Alten-Lorenz
2013-01-14, 17:43
Andrew Otto
2013-01-14, 18:01
Andrew Otto
2013-01-15, 19:31
Andrew Otto
2013-01-16, 21:22
Brock Noland
2013-01-16, 21:36
Andrew Otto
2013-01-16, 22:30
Brock Noland
2013-01-16, 22:34
Hari Shreedharan
2013-01-16, 22:47
Andrew Otto
2013-01-16, 23:03
Hari Shreedharan
2013-01-16, 23:09
Bhaskar V. Karambelkar
2013-01-17, 01:21
Andrew Otto
2013-01-17, 15:34
Andrew Otto
2013-01-17, 16:26
Andrew Otto
2013-01-17, 17:36
Jeff Lord
2013-01-17, 17:59
Brock Noland
2013-01-17, 18:04
Andrew Otto
2013-01-17, 18:56
Andrew Otto
2013-01-17, 17:33
|
-
Need for UDP / Multicast SourceAndrew Otto 2013-01-14, 17:29
Hi all,
I'm an Systems Engineer at the Wikimedia Foundation, and we're investigating using Flume for our web request log HDFS imports. We've previously been using Kafka, but have had to change short term architecture plans in order to get data into HDFS reliably and regularly soon. Our current web request logs are available for consumption over a multicast UDP stream. I could hack something together to try and pipe this into Flume using the existing sources (SyslogUDPSource, or maybe some combination of socat + NetcatSource), but I'd rather reduce the number of moving parts. I'd like to consume directly from the multicast UDP stream as a Flume source. I coded up proof of concept based on the SyslogUDPSource, mainly just stripping out the syslog event header extraction, and adding in multicast Datagram connection code. I plan on cleaning this up, and making this a generic raw UDP source, with multicast being a configuration option. My question to you guys is, is this something the Flume community would find useful? If so, should I open up a JIRA to track this? I've got a fork of the Flume git repo over on github and will be doing my work there. I'd love to share it upstream if it would be useful. Thanks! -Andrew Otto Systems Engineer Wikimedia Foundation +
Andrew Otto 2013-01-14, 17:29
-
Re: Need for UDP / Multicast SourceHari Shreedharan 2013-01-14, 17:37
Hi Andrew,
Really happy to hear Wikimedia Foundation is considering Flume. I am fairly sure that if you find such a source useful, there would definitely be others who find it useful too. I'd recommend filing a jira and starting a discussion, and then submitting the patch. We would be happy to review and commit it. Thanks, Hari -- Hari Shreedharan On Monday, January 14, 2013 at 9:29 AM, Andrew Otto wrote: > Hi all, > > I'm an Systems Engineer at the Wikimedia Foundation, and we're investigating using Flume for our web request log HDFS imports. We've previously been using Kafka, but have had to change short term architecture plans in order to get data into HDFS reliably and regularly soon. > > Our current web request logs are available for consumption over a multicast UDP stream. I could hack something together to try and pipe this into Flume using the existing sources (SyslogUDPSource, or maybe some combination of socat + NetcatSource), but I'd rather reduce the number of moving parts. I'd like to consume directly from the multicast UDP stream as a Flume source. > > I coded up proof of concept based on the SyslogUDPSource, mainly just stripping out the syslog event header extraction, and adding in multicast Datagram connection code. I plan on cleaning this up, and making this a generic raw UDP source, with multicast being a configuration option. > > My question to you guys is, is this something the Flume community would find useful? If so, should I open up a JIRA to track this? I've got a fork of the Flume git repo over on github and will be doing my work there. I'd love to share it upstream if it would be useful. > > Thanks! > -Andrew Otto > Systems Engineer > Wikimedia Foundation > > +
Hari Shreedharan 2013-01-14, 17:37
-
Re: Need for UDP / Multicast SourceAlexander Alten-Lorenz 2013-01-14, 17:43
Hey Andrew,
for your reference, we have a lot of developer informations in our wiki: https://cwiki.apache.org/confluence/display/FLUME/Developer+Section https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet cheers, Alex On Jan 14, 2013, at 6:37 PM, Hari Shreedharan <[EMAIL PROTECTED]> wrote: > Hi Andrew, > > Really happy to hear Wikimedia Foundation is considering Flume. I am fairly sure that if you find such a source useful, there would definitely be others who find it useful too. I'd recommend filing a jira and starting a discussion, and then submitting the patch. We would be happy to review and commit it. > > > Thanks, > Hari > > -- > Hari Shreedharan > > > On Monday, January 14, 2013 at 9:29 AM, Andrew Otto wrote: > >> Hi all, >> >> I'm an Systems Engineer at the Wikimedia Foundation, and we're investigating using Flume for our web request log HDFS imports. We've previously been using Kafka, but have had to change short term architecture plans in order to get data into HDFS reliably and regularly soon. >> >> Our current web request logs are available for consumption over a multicast UDP stream. I could hack something together to try and pipe this into Flume using the existing sources (SyslogUDPSource, or maybe some combination of socat + NetcatSource), but I'd rather reduce the number of moving parts. I'd like to consume directly from the multicast UDP stream as a Flume source. >> >> I coded up proof of concept based on the SyslogUDPSource, mainly just stripping out the syslog event header extraction, and adding in multicast Datagram connection code. I plan on cleaning this up, and making this a generic raw UDP source, with multicast being a configuration option. >> >> My question to you guys is, is this something the Flume community would find useful? If so, should I open up a JIRA to track this? I've got a fork of the Flume git repo over on github and will be doing my work there. I'd love to share it upstream if it would be useful. >> >> Thanks! >> -Andrew Otto >> Systems Engineer >> Wikimedia Foundation >> >> > > -- Alexander Alten-Lorenz http://mapredit.blogspot.com German Hadoop LinkedIn Group: http://goo.gl/N8pCF +
Alexander Alten-Lorenz 2013-01-14, 17:43
-
Re: Need for UDP / Multicast SourceAndrew Otto 2013-01-14, 18:01
Thanks guys! I've opened up a JIRA here:
https://issues.apache.org/jira/browse/FLUME-1838 On Jan 14, 2013, at 12:43 PM, Alexander Alten-Lorenz <[EMAIL PROTECTED]> wrote: > Hey Andrew, > > for your reference, we have a lot of developer informations in our wiki: > > https://cwiki.apache.org/confluence/display/FLUME/Developer+Section > https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet > > cheers, > Alex > > On Jan 14, 2013, at 6:37 PM, Hari Shreedharan <[EMAIL PROTECTED]> wrote: > >> Hi Andrew, >> >> Really happy to hear Wikimedia Foundation is considering Flume. I am fairly sure that if you find such a source useful, there would definitely be others who find it useful too. I'd recommend filing a jira and starting a discussion, and then submitting the patch. We would be happy to review and commit it. >> >> >> Thanks, >> Hari >> >> -- >> Hari Shreedharan >> >> >> On Monday, January 14, 2013 at 9:29 AM, Andrew Otto wrote: >> >>> Hi all, >>> >>> I'm an Systems Engineer at the Wikimedia Foundation, and we're investigating using Flume for our web request log HDFS imports. We've previously been using Kafka, but have had to change short term architecture plans in order to get data into HDFS reliably and regularly soon. >>> >>> Our current web request logs are available for consumption over a multicast UDP stream. I could hack something together to try and pipe this into Flume using the existing sources (SyslogUDPSource, or maybe some combination of socat + NetcatSource), but I'd rather reduce the number of moving parts. I'd like to consume directly from the multicast UDP stream as a Flume source. >>> >>> I coded up proof of concept based on the SyslogUDPSource, mainly just stripping out the syslog event header extraction, and adding in multicast Datagram connection code. I plan on cleaning this up, and making this a generic raw UDP source, with multicast being a configuration option. >>> >>> My question to you guys is, is this something the Flume community would find useful? If so, should I open up a JIRA to track this? I've got a fork of the Flume git repo over on github and will be doing my work there. I'd love to share it upstream if it would be useful. >>> >>> Thanks! >>> -Andrew Otto >>> Systems Engineer >>> Wikimedia Foundation >>> >>> >> >> > > -- > Alexander Alten-Lorenz > http://mapredit.blogspot.com > German Hadoop LinkedIn Group: http://goo.gl/N8pCF > +
Andrew Otto 2013-01-14, 18:01
-
Re: Need for UDP / Multicast SourceAndrew Otto 2013-01-15, 19:31
I just submitted the patch on https://issues.apache.org/jira/browse/FLUME-1838.
Would love some reviews, thanks! -Andrew On Jan 14, 2013, at 1:01 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: > Thanks guys! I've opened up a JIRA here: > > https://issues.apache.org/jira/browse/FLUME-1838 > > > On Jan 14, 2013, at 12:43 PM, Alexander Alten-Lorenz <[EMAIL PROTECTED]> wrote: > >> Hey Andrew, >> >> for your reference, we have a lot of developer informations in our wiki: >> >> https://cwiki.apache.org/confluence/display/FLUME/Developer+Section >> https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet >> >> cheers, >> Alex >> >> On Jan 14, 2013, at 6:37 PM, Hari Shreedharan <[EMAIL PROTECTED]> wrote: >> >>> Hi Andrew, >>> >>> Really happy to hear Wikimedia Foundation is considering Flume. I am fairly sure that if you find such a source useful, there would definitely be others who find it useful too. I'd recommend filing a jira and starting a discussion, and then submitting the patch. We would be happy to review and commit it. >>> >>> >>> Thanks, >>> Hari >>> >>> -- >>> Hari Shreedharan >>> >>> >>> On Monday, January 14, 2013 at 9:29 AM, Andrew Otto wrote: >>> >>>> Hi all, >>>> >>>> I'm an Systems Engineer at the Wikimedia Foundation, and we're investigating using Flume for our web request log HDFS imports. We've previously been using Kafka, but have had to change short term architecture plans in order to get data into HDFS reliably and regularly soon. >>>> >>>> Our current web request logs are available for consumption over a multicast UDP stream. I could hack something together to try and pipe this into Flume using the existing sources (SyslogUDPSource, or maybe some combination of socat + NetcatSource), but I'd rather reduce the number of moving parts. I'd like to consume directly from the multicast UDP stream as a Flume source. >>>> >>>> I coded up proof of concept based on the SyslogUDPSource, mainly just stripping out the syslog event header extraction, and adding in multicast Datagram connection code. I plan on cleaning this up, and making this a generic raw UDP source, with multicast being a configuration option. >>>> >>>> My question to you guys is, is this something the Flume community would find useful? If so, should I open up a JIRA to track this? I've got a fork of the Flume git repo over on github and will be doing my work there. I'd love to share it upstream if it would be useful. >>>> >>>> Thanks! >>>> -Andrew Otto >>>> Systems Engineer >>>> Wikimedia Foundation >>>> >>>> >>> >>> >> >> -- >> Alexander Alten-Lorenz >> http://mapredit.blogspot.com >> German Hadoop LinkedIn Group: http://goo.gl/N8pCF >> > +
Andrew Otto 2013-01-15, 19:31
-
Re: Need for UDP / Multicast SourceAndrew Otto 2013-01-16, 21:22
Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream. This is available to me via UDP Multicast. Everything seems to be working great, except that I seem to be missing a lot of data.
Our webrequest log stream consists of about 100000 events per second, which amounts to around 50 Mb per second. I understand that this is probably too much for a single node to handle, but I should be able to either see most of the data written to HDFS, or at least see errors about channels being filled to capacity. HDFS files are set to roll every 60 seconds. Each of my files is only about 4.2MB, which is only 72 Kb per second. That's only 0.14% of the data I'm expecting to consume. Where did the rest of it go? If Flume is dropping it, why doesn't it tell me!? Here's my flume.conf: https://gist.github.com/4551001 Thanks! On Jan 15, 2013, at 2:31 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: > I just submitted the patch on https://issues.apache.org/jira/browse/FLUME-1838. > > Would love some reviews, thanks! > -Andrew > > > On Jan 14, 2013, at 1:01 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: > >> Thanks guys! I've opened up a JIRA here: >> >> https://issues.apache.org/jira/browse/FLUME-1838 >> >> >> On Jan 14, 2013, at 12:43 PM, Alexander Alten-Lorenz <[EMAIL PROTECTED]> wrote: >> >>> Hey Andrew, >>> >>> for your reference, we have a lot of developer informations in our wiki: >>> >>> https://cwiki.apache.org/confluence/display/FLUME/Developer+Section >>> https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet >>> >>> cheers, >>> Alex >>> >>> On Jan 14, 2013, at 6:37 PM, Hari Shreedharan <[EMAIL PROTECTED]> wrote: >>> >>>> Hi Andrew, >>>> >>>> Really happy to hear Wikimedia Foundation is considering Flume. I am fairly sure that if you find such a source useful, there would definitely be others who find it useful too. I'd recommend filing a jira and starting a discussion, and then submitting the patch. We would be happy to review and commit it. >>>> >>>> >>>> Thanks, >>>> Hari >>>> >>>> -- >>>> Hari Shreedharan >>>> >>>> >>>> On Monday, January 14, 2013 at 9:29 AM, Andrew Otto wrote: >>>> >>>>> Hi all, >>>>> >>>>> I'm an Systems Engineer at the Wikimedia Foundation, and we're investigating using Flume for our web request log HDFS imports. We've previously been using Kafka, but have had to change short term architecture plans in order to get data into HDFS reliably and regularly soon. >>>>> >>>>> Our current web request logs are available for consumption over a multicast UDP stream. I could hack something together to try and pipe this into Flume using the existing sources (SyslogUDPSource, or maybe some combination of socat + NetcatSource), but I'd rather reduce the number of moving parts. I'd like to consume directly from the multicast UDP stream as a Flume source. >>>>> >>>>> I coded up proof of concept based on the SyslogUDPSource, mainly just stripping out the syslog event header extraction, and adding in multicast Datagram connection code. I plan on cleaning this up, and making this a generic raw UDP source, with multicast being a configuration option. >>>>> >>>>> My question to you guys is, is this something the Flume community would find useful? If so, should I open up a JIRA to track this? I've got a fork of the Flume git repo over on github and will be doing my work there. I'd love to share it upstream if it would be useful. >>>>> >>>>> Thanks! >>>>> -Andrew Otto >>>>> Systems Engineer >>>>> Wikimedia Foundation >>>>> >>>>> >>>> >>>> >>> >>> -- >>> Alexander Alten-Lorenz >>> http://mapredit.blogspot.com >>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF >>> >> > +
Andrew Otto 2013-01-16, 21:22
-
Re: Need for UDP / Multicast SourceBrock Noland 2013-01-16, 21:36
Hi,
I would use memory channel for now as opposed to file channel. For file channel to keep up with that you'd need multiple disks. Also your checkpoint period is super-low which will cause lots of checkpoints and slow things down. However, I think the biggest issue is probably batch size. With that much data you are likely going to want a large batch size for all components involved. Something a low multiple of 1000. There is a good article on this: https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 To re-cap would: Use memory channel for now. Once you prove things work you can work on tuning file channel (going to write larger batch sizes and multiple disks). Increase the batch size for your source/sink. On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: > Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream. This is available to me via UDP Multicast. Everything seems to be working great, except that I seem to be missing a lot of data. > > Our webrequest log stream consists of about 100000 events per second, which amounts to around 50 Mb per second. > > I understand that this is probably too much for a single node to handle, but I should be able to either see most of the data written to HDFS, or at least see errors about channels being filled to capacity. HDFS files are set to roll every 60 seconds. Each of my files is only about 4.2MB, which is only 72 Kb per second. That's only 0.14% of the data I'm expecting to consume. > > Where did the rest of it go? If Flume is dropping it, why doesn't it tell me!? > > Here's my flume.conf: > > https://gist.github.com/4551001 > > > Thanks! > > > > > On Jan 15, 2013, at 2:31 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: > >> I just submitted the patch on https://issues.apache.org/jira/browse/FLUME-1838. >> >> Would love some reviews, thanks! >> -Andrew >> >> >> On Jan 14, 2013, at 1:01 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: >> >>> Thanks guys! I've opened up a JIRA here: >>> >>> https://issues.apache.org/jira/browse/FLUME-1838 >>> >>> >>> On Jan 14, 2013, at 12:43 PM, Alexander Alten-Lorenz <[EMAIL PROTECTED]> wrote: >>> >>>> Hey Andrew, >>>> >>>> for your reference, we have a lot of developer informations in our wiki: >>>> >>>> https://cwiki.apache.org/confluence/display/FLUME/Developer+Section >>>> https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet >>>> >>>> cheers, >>>> Alex >>>> >>>> On Jan 14, 2013, at 6:37 PM, Hari Shreedharan <[EMAIL PROTECTED]> wrote: >>>> >>>>> Hi Andrew, >>>>> >>>>> Really happy to hear Wikimedia Foundation is considering Flume. I am fairly sure that if you find such a source useful, there would definitely be others who find it useful too. I'd recommend filing a jira and starting a discussion, and then submitting the patch. We would be happy to review and commit it. >>>>> >>>>> >>>>> Thanks, >>>>> Hari >>>>> >>>>> -- >>>>> Hari Shreedharan >>>>> >>>>> >>>>> On Monday, January 14, 2013 at 9:29 AM, Andrew Otto wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I'm an Systems Engineer at the Wikimedia Foundation, and we're investigating using Flume for our web request log HDFS imports. We've previously been using Kafka, but have had to change short term architecture plans in order to get data into HDFS reliably and regularly soon. >>>>>> >>>>>> Our current web request logs are available for consumption over a multicast UDP stream. I could hack something together to try and pipe this into Flume using the existing sources (SyslogUDPSource, or maybe some combination of socat + NetcatSource), but I'd rather reduce the number of moving parts. I'd like to consume directly from the multicast UDP stream as a Flume source. >>>>>> >>>>>> I coded up proof of concept based on the SyslogUDPSource, mainly just stripping out the syslog event header extraction, and adding in multicast Datagram connection code. I plan on cleaning this up, and making this a generic raw UDP source, with multicast being a configuration option. Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ +
Brock Noland 2013-01-16, 21:36
-
Re: Need for UDP / Multicast SourceAndrew Otto 2013-01-16, 22:30
Cool, thanks for the advice! That's a great blog post.
I've changed my ways (for now at least). I've got lots of disks to use once memory starts working, and this node has tooooons of memory (192G). Here's my new flume.conf: https://gist.github.com/4551513 This is doing better, for sure. Note that I took out the timestamp regex_extractor just in case that was impacting performance. I'm using the regular ol' timestamp interceptor now. I'm still not doing so great though. I'm getting about 300 Mb per minute in my HDFS files. I should be getting about 300G. That's better than before though. I've got 10% of the data this time, rather than 0.14% :) On Jan 16, 2013, at 4:36 PM, Brock Noland <[EMAIL PROTECTED]> wrote: > Hi, > > I would use memory channel for now as opposed to file channel. For > file channel to keep up with that you'd need multiple disks. Also your > checkpoint period is super-low which will cause lots of checkpoints > and slow things down. > > However, I think the biggest issue is probably batch size. With that > much data you are likely going to want a large batch size for all > components involved. Something a low multiple of 1000. There is a good > article on this: > https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 > > To re-cap would: > > Use memory channel for now. Once you prove things work you can work on > tuning file channel (going to write larger batch sizes and multiple > disks). > > Increase the batch size for your source/sink. > > On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: >> Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream. This is available to me via UDP Multicast. Everything seems to be working great, except that I seem to be missing a lot of data. >> >> Our webrequest log stream consists of about 100000 events per second, which amounts to around 50 Mb per second. >> >> I understand that this is probably too much for a single node to handle, but I should be able to either see most of the data written to HDFS, or at least see errors about channels being filled to capacity. HDFS files are set to roll every 60 seconds. Each of my files is only about 4.2MB, which is only 72 Kb per second. That's only 0.14% of the data I'm expecting to consume. >> >> Where did the rest of it go? If Flume is dropping it, why doesn't it tell me!? >> >> Here's my flume.conf: >> >> https://gist.github.com/4551001 >> >> >> Thanks! >> >> >> >> >> On Jan 15, 2013, at 2:31 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: >> >>> I just submitted the patch on https://issues.apache.org/jira/browse/FLUME-1838. >>> >>> Would love some reviews, thanks! >>> -Andrew >>> >>> >>> On Jan 14, 2013, at 1:01 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: >>> >>>> Thanks guys! I've opened up a JIRA here: >>>> >>>> https://issues.apache.org/jira/browse/FLUME-1838 >>>> >>>> >>>> On Jan 14, 2013, at 12:43 PM, Alexander Alten-Lorenz <[EMAIL PROTECTED]> wrote: >>>> >>>>> Hey Andrew, >>>>> >>>>> for your reference, we have a lot of developer informations in our wiki: >>>>> >>>>> https://cwiki.apache.org/confluence/display/FLUME/Developer+Section >>>>> https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet >>>>> >>>>> cheers, >>>>> Alex >>>>> >>>>> On Jan 14, 2013, at 6:37 PM, Hari Shreedharan <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> Hi Andrew, >>>>>> >>>>>> Really happy to hear Wikimedia Foundation is considering Flume. I am fairly sure that if you find such a source useful, there would definitely be others who find it useful too. I'd recommend filing a jira and starting a discussion, and then submitting the patch. We would be happy to review and commit it. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Hari >>>>>> >>>>>> -- >>>>>> Hari Shreedharan >>>>>> >>>>>> >>>>>> On Monday, January 14, 2013 at 9:29 AM, Andrew Otto wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I'm an Systems Engineer at the Wikimedia Foundation, and we're investigating using Flume for our web request log HDFS imports. We've previously been using Kafka, but have had to change short term architecture plans in order to get data into HDFS reliably and regularly soon. +
Andrew Otto 2013-01-16, 22:30
-
Re: Need for UDP / Multicast SourceBrock Noland 2013-01-16, 22:34
Good to hear! Take five six thread dumps of it and then them our way.
On Wed, Jan 16, 2013 at 2:30 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: > Cool, thanks for the advice! That's a great blog post. > > I've changed my ways (for now at least). I've got lots of disks to use once memory starts working, and this node has tooooons of memory (192G). > > Here's my new flume.conf: > https://gist.github.com/4551513 > > This is doing better, for sure. Note that I took out the timestamp regex_extractor just in case that was impacting performance. I'm using the regular ol' timestamp interceptor now. > > I'm still not doing so great though. I'm getting about 300 Mb per minute in my HDFS files. I should be getting about 300G. That's better than before though. I've got 10% of the data this time, rather than 0.14% :) > > > > > On Jan 16, 2013, at 4:36 PM, Brock Noland <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> I would use memory channel for now as opposed to file channel. For >> file channel to keep up with that you'd need multiple disks. Also your >> checkpoint period is super-low which will cause lots of checkpoints >> and slow things down. >> >> However, I think the biggest issue is probably batch size. With that >> much data you are likely going to want a large batch size for all >> components involved. Something a low multiple of 1000. There is a good >> article on this: >> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 >> >> To re-cap would: >> >> Use memory channel for now. Once you prove things work you can work on >> tuning file channel (going to write larger batch sizes and multiple >> disks). >> >> Increase the batch size for your source/sink. >> >> On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: >>> Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream. This is available to me via UDP Multicast. Everything seems to be working great, except that I seem to be missing a lot of data. >>> >>> Our webrequest log stream consists of about 100000 events per second, which amounts to around 50 Mb per second. >>> >>> I understand that this is probably too much for a single node to handle, but I should be able to either see most of the data written to HDFS, or at least see errors about channels being filled to capacity. HDFS files are set to roll every 60 seconds. Each of my files is only about 4.2MB, which is only 72 Kb per second. That's only 0.14% of the data I'm expecting to consume. >>> >>> Where did the rest of it go? If Flume is dropping it, why doesn't it tell me!? >>> >>> Here's my flume.conf: >>> >>> https://gist.github.com/4551001 >>> >>> >>> Thanks! >>> >>> >>> >>> >>> On Jan 15, 2013, at 2:31 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: >>> >>>> I just submitted the patch on https://issues.apache.org/jira/browse/FLUME-1838. >>>> >>>> Would love some reviews, thanks! >>>> -Andrew >>>> >>>> >>>> On Jan 14, 2013, at 1:01 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: >>>> >>>>> Thanks guys! I've opened up a JIRA here: >>>>> >>>>> https://issues.apache.org/jira/browse/FLUME-1838 >>>>> >>>>> >>>>> On Jan 14, 2013, at 12:43 PM, Alexander Alten-Lorenz <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> Hey Andrew, >>>>>> >>>>>> for your reference, we have a lot of developer informations in our wiki: >>>>>> >>>>>> https://cwiki.apache.org/confluence/display/FLUME/Developer+Section >>>>>> https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet >>>>>> >>>>>> cheers, >>>>>> Alex >>>>>> >>>>>> On Jan 14, 2013, at 6:37 PM, Hari Shreedharan <[EMAIL PROTECTED]> wrote: >>>>>> >>>>>>> Hi Andrew, >>>>>>> >>>>>>> Really happy to hear Wikimedia Foundation is considering Flume. I am fairly sure that if you find such a source useful, there would definitely be others who find it useful too. I'd recommend filing a jira and starting a discussion, and then submitting the patch. We would be happy to review and commit it. >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Hari Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ +
Brock Noland 2013-01-16, 22:34
-
Re: Need for UDP / Multicast SourceHari Shreedharan 2013-01-16, 22:47
Also can you try adding more HDFS sinks reading from the same channel. I'd recommend using different file prefixes, or paths for each sink, to avoid collision. Since each sink really has just one thread driving them, adding multiple sinks might help. Also, keep an eye on the memory channel's sizes and see if it is filling up (there will be ChannelExceptions in the logs if it is).
Hari -- Hari Shreedharan On Wednesday, January 16, 2013 at 2:34 PM, Brock Noland wrote: > Good to hear! Take five six thread dumps of it and then them our way. > > On Wed, Jan 16, 2013 at 2:30 PM, Andrew Otto <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote: > > Cool, thanks for the advice! That's a great blog post. > > > > I've changed my ways (for now at least). I've got lots of disks to use once memory starts working, and this node has tooooons of memory (192G). > > > > Here's my new flume.conf: > > https://gist.github.com/4551513 > > > > This is doing better, for sure. Note that I took out the timestamp regex_extractor just in case that was impacting performance. I'm using the regular ol' timestamp interceptor now. > > > > I'm still not doing so great though. I'm getting about 300 Mb per minute in my HDFS files. I should be getting about 300G. That's better than before though. I've got 10% of the data this time, rather than 0.14% :) > > > > > > > > > > On Jan 16, 2013, at 4:36 PM, Brock Noland <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote: > > > > > Hi, > > > > > > I would use memory channel for now as opposed to file channel. For > > > file channel to keep up with that you'd need multiple disks. Also your > > > checkpoint period is super-low which will cause lots of checkpoints > > > and slow things down. > > > > > > However, I think the biggest issue is probably batch size. With that > > > much data you are likely going to want a large batch size for all > > > components involved. Something a low multiple of 1000. There is a good > > > article on this: > > > https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 > > > > > > To re-cap would: > > > > > > Use memory channel for now. Once you prove things work you can work on > > > tuning file channel (going to write larger batch sizes and multiple > > > disks). > > > > > > Increase the batch size for your source/sink. > > > > > > On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote: > > > > Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream. This is available to me via UDP Multicast. Everything seems to be working great, except that I seem to be missing a lot of data. > > > > > > > > Our webrequest log stream consists of about 100000 events per second, which amounts to around 50 Mb per second. > > > > > > > > I understand that this is probably too much for a single node to handle, but I should be able to either see most of the data written to HDFS, or at least see errors about channels being filled to capacity. HDFS files are set to roll every 60 seconds. Each of my files is only about 4.2MB, which is only 72 Kb per second. That's only 0.14% of the data I'm expecting to consume. > > > > > > > > Where did the rest of it go? If Flume is dropping it, why doesn't it tell me!? > > > > > > > > Here's my flume.conf: > > > > > > > > https://gist.github.com/4551001 > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > On Jan 15, 2013, at 2:31 PM, Andrew Otto <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote: > > > > > > > > > I just submitted the patch on https://issues.apache.org/jira/browse/FLUME-1838. > > > > > > > > > > Would love some reviews, thanks! > > > > > -Andrew > > > > > > > > > > > > > > > On Jan 14, 2013, at 1:01 PM, Andrew Otto <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote: > > > > > > > > > > > Thanks guys! I've opened up a JIRA here: > > > > > > > > > > > > https://issues.apache.org/jira/browse/FLUME-1838 > > > > > > > > > > > > +
Hari Shreedharan 2013-01-16, 22:47
-
Re: Need for UDP / Multicast SourceAndrew Otto 2013-01-16, 23:03
Ok, thanks. Quick Q: Won't each sink consume the same data? Do I need to set up the load balancing sink processor to keep that from happening?
On Jan 16, 2013, at 5:47 PM, Hari Shreedharan <[EMAIL PROTECTED]> wrote: > Also can you try adding more HDFS sinks reading from the same channel. I'd recommend using different file prefixes, or paths for each sink, to avoid collision. Since each sink really has just one thread driving them, adding multiple sinks might help. Also, keep an eye on the memory channel's sizes and see if it is filling up (there will be ChannelExceptions in the logs if it is). > > > Hari > > -- > Hari Shreedharan > > On Wednesday, January 16, 2013 at 2:34 PM, Brock Noland wrote: > >> Good to hear! Take five six thread dumps of it and then them our way. >> >> On Wed, Jan 16, 2013 at 2:30 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: >>> Cool, thanks for the advice! That's a great blog post. >>> >>> I've changed my ways (for now at least). I've got lots of disks to use once memory starts working, and this node has tooooons of memory (192G). >>> >>> Here's my new flume.conf: >>> https://gist.github.com/4551513 >>> >>> This is doing better, for sure. Note that I took out the timestamp regex_extractor just in case that was impacting performance. I'm using the regular ol' timestamp interceptor now. >>> >>> I'm still not doing so great though. I'm getting about 300 Mb per minute in my HDFS files. I should be getting about 300G. That's better than before though. I've got 10% of the data this time, rather than 0.14% :) >>> >>> >>> >>> >>> On Jan 16, 2013, at 4:36 PM, Brock Noland <[EMAIL PROTECTED]> wrote: >>> >>>> Hi, >>>> >>>> I would use memory channel for now as opposed to file channel. For >>>> file channel to keep up with that you'd need multiple disks. Also your >>>> checkpoint period is super-low which will cause lots of checkpoints >>>> and slow things down. >>>> >>>> However, I think the biggest issue is probably batch size. With that >>>> much data you are likely going to want a large batch size for all >>>> components involved. Something a low multiple of 1000. There is a good >>>> article on this: >>>> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 >>>> >>>> To re-cap would: >>>> >>>> Use memory channel for now. Once you prove things work you can work on >>>> tuning file channel (going to write larger batch sizes and multiple >>>> disks). >>>> >>>> Increase the batch size for your source/sink. >>>> >>>> On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: >>>>> Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream. This is available to me via UDP Multicast. Everything seems to be working great, except that I seem to be missing a lot of data. >>>>> >>>>> Our webrequest log stream consists of about 100000 events per second, which amounts to around 50 Mb per second. >>>>> >>>>> I understand that this is probably too much for a single node to handle, but I should be able to either see most of the data written to HDFS, or at least see errors about channels being filled to capacity. HDFS files are set to roll every 60 seconds. Each of my files is only about 4.2MB, which is only 72 Kb per second. That's only 0.14% of the data I'm expecting to consume. >>>>> >>>>> Where did the rest of it go? If Flume is dropping it, why doesn't it tell me!? >>>>> >>>>> Here's my flume.conf: >>>>> >>>>> https://gist.github.com/4551001 >>>>> >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> >>>>> On Jan 15, 2013, at 2:31 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> I just submitted the patch on https://issues.apache.org/jira/browse/FLUME-1838. >>>>>> >>>>>> Would love some reviews, thanks! >>>>>> -Andrew >>>>>> >>>>>> >>>>>> On Jan 14, 2013, at 1:01 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: >>>>>> >>>>>>> Thanks guys! I've opened up a JIRA here: >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/FLUME-1838 >>>> +
Andrew Otto 2013-01-16, 23:03
-
Re: Need for UDP / Multicast SourceHari Shreedharan 2013-01-16, 23:09
No, each sink will not consume the same data. If data is taken and committed from a channel, only the sink which took it will see it. When a sink calls take, no other sink will be able to access the data (though it is still in the channel) unless the transaction is rolled back (or in case of the FileChannel, the channel gets restarted due to agent restart or reconfig). If you have a sink processor, only one of the n sinks in the group is active at one time (basically there is one thread running the n sinks, polling them based on the sink processor's decision on which sink to poll). Without a sink processor, each sink gets its own sink runner thread.
Hari -- Hari Shreedharan On Wednesday, January 16, 2013 at 3:03 PM, Andrew Otto wrote: > Ok, thanks. Quick Q: Won't each sink consume the same data? Do I need to set up the load balancing sink processor to keep that from happening? > > > On Jan 16, 2013, at 5:47 PM, Hari Shreedharan <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote: > > Also can you try adding more HDFS sinks reading from the same channel. I'd recommend using different file prefixes, or paths for each sink, to avoid collision. Since each sink really has just one thread driving them, adding multiple sinks might help. Also, keep an eye on the memory channel's sizes and see if it is filling up (there will be ChannelExceptions in the logs if it is). > > > > > > Hari > > > > -- > > Hari Shreedharan > > > > > > On Wednesday, January 16, 2013 at 2:34 PM, Brock Noland wrote: > > > > > Good to hear! Take five six thread dumps of it and then them our way. > > > > > > On Wed, Jan 16, 2013 at 2:30 PM, Andrew Otto <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote: > > > > Cool, thanks for the advice! That's a great blog post. > > > > > > > > I've changed my ways (for now at least). I've got lots of disks to use once memory starts working, and this node has tooooons of memory (192G). > > > > > > > > Here's my new flume.conf: > > > > https://gist.github.com/4551513 > > > > > > > > This is doing better, for sure. Note that I took out the timestamp regex_extractor just in case that was impacting performance. I'm using the regular ol' timestamp interceptor now. > > > > > > > > I'm still not doing so great though. I'm getting about 300 Mb per minute in my HDFS files. I should be getting about 300G. That's better than before though. I've got 10% of the data this time, rather than 0.14% :) > > > > > > > > > > > > > > > > > > > > On Jan 16, 2013, at 4:36 PM, Brock Noland <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote: > > > > > > > > > Hi, > > > > > > > > > > I would use memory channel for now as opposed to file channel. For > > > > > file channel to keep up with that you'd need multiple disks. Also your > > > > > checkpoint period is super-low which will cause lots of checkpoints > > > > > and slow things down. > > > > > > > > > > However, I think the biggest issue is probably batch size. With that > > > > > much data you are likely going to want a large batch size for all > > > > > components involved. Something a low multiple of 1000. There is a good > > > > > article on this: > > > > > https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 > > > > > > > > > > To re-cap would: > > > > > > > > > > Use memory channel for now. Once you prove things work you can work on > > > > > tuning file channel (going to write larger batch sizes and multiple > > > > > disks). > > > > > > > > > > Increase the batch size for your source/sink. > > > > > > > > > > On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote: > > > > > > Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream. This is available to me via UDP Multicast. Everything seems to be working great, except that I seem to be missing a lot of data. > > > > > > > > > > > > Our webrequest log stream consists of about 100000 events per second, which amounts to around 50 Mb per second. +
Hari Shreedharan 2013-01-16, 23:09
-
Re: Need for UDP / Multicast SourceBhaskar V. Karambelkar 2013-01-17, 01:21
My be a stupid question, but since you're working with UDP, are you sure
all your data is making through to flume. with UDP there's no guaranty that the data will reach destination. Can you see if something like 'netstat -su' on sources and destination flume nodes shows any problems. Bhaskar On Wed, Jan 16, 2013 at 6:09 PM, Hari Shreedharan <[EMAIL PROTECTED] > wrote: > No, each sink will not consume the same data. If data is taken and > committed from a channel, only the sink which took it will see it. When a > sink calls take, no other sink will be able to access the data (though it > is still in the channel) unless the transaction is rolled back (or in case > of the FileChannel, the channel gets restarted due to agent restart or > reconfig). If you have a sink processor, only one of the n sinks in the > group is active at one time (basically there is one thread running the n > sinks, polling them based on the sink processor's decision on which sink to > poll). Without a sink processor, each sink gets its own sink runner > thread. > > > Hari > > -- > Hari Shreedharan > > On Wednesday, January 16, 2013 at 3:03 PM, Andrew Otto wrote: > > Ok, thanks. Quick Q: Won't each sink consume the same data? Do I need > to set up the load balancing sink processor to keep that from happening? > > > On Jan 16, 2013, at 5:47 PM, Hari Shreedharan <[EMAIL PROTECTED]> > wrote: > > Also can you try adding more HDFS sinks reading from the same channel. > I'd recommend using different file prefixes, or paths for each sink, to > avoid collision. Since each sink really has just one thread driving them, > adding multiple sinks might help. Also, keep an eye on the memory channel's > sizes and see if it is filling up (there will be ChannelExceptions in the > logs if it is). > > > Hari > > -- > Hari Shreedharan > > On Wednesday, January 16, 2013 at 2:34 PM, Brock Noland wrote: > > Good to hear! Take five six thread dumps of it and then them our way. > > On Wed, Jan 16, 2013 at 2:30 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: > > Cool, thanks for the advice! That's a great blog post. > > I've changed my ways (for now at least). I've got lots of disks to use > once memory starts working, and this node has tooooons of memory (192G). > > Here's my new flume.conf: > https://gist.github.com/4551513 > > This is doing better, for sure. Note that I took out the timestamp > regex_extractor just in case that was impacting performance. I'm using the > regular ol' timestamp interceptor now. > > I'm still not doing so great though. I'm getting about 300 Mb per minute > in my HDFS files. I should be getting about 300G. That's better than before > though. I've got 10% of the data this time, rather than 0.14% :) > > > > > On Jan 16, 2013, at 4:36 PM, Brock Noland <[EMAIL PROTECTED]> wrote: > > Hi, > > I would use memory channel for now as opposed to file channel. For > file channel to keep up with that you'd need multiple disks. Also your > checkpoint period is super-low which will cause lots of checkpoints > and slow things down. > > However, I think the biggest issue is probably batch size. With that > much data you are likely going to want a large batch size for all > components involved. Something a low multiple of 1000. There is a good > article on this: > https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 > > To re-cap would: > > Use memory channel for now. Once you prove things work you can work on > tuning file channel (going to write larger batch sizes and multiple > disks). > > Increase the batch size for your source/sink. > > On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <[EMAIL PROTECTED]> wrote: > > Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream. > This is available to me via UDP Multicast. Everything seems to be working > great, except that I seem to be missing a lot of data. > > Our webrequest log stream consists of about 100000 events per second, > which amounts to around 50 Mb per second. +
Bhaskar V. Karambelkar 2013-01-17, 01:21
-
Re: Need for UDP / Multicast SourceAndrew Otto 2013-01-17, 15:34
> with UDP there's no guaranty that the data will reach destination.
True, but I'm experimenting with using Flume as a replacement for a system that is already in place. I actually got the numbers I listed below by grabbing data directly off of the UDP stream and saving them to a file on local disk. Its possible that UDP data is getting lost in the network somewhere, but if that were the case I wouldn't know about it. I am comparing Flume's performance to a single process writing to a local disk. +
Andrew Otto 2013-01-17, 15:34
-
Re: Need for UDP / Multicast SourceAndrew Otto 2013-01-17, 16:26
Ok, I'm still struggling with this a bit. Here's what I've currently got going.
In order to make it easier to check what I am and am not receiving, I've narrowed the logs that I store in HDFS down to those originating from a single host (cp1044.wikimedia.org). Each host generates contiguous sequence numbers for each log line. I can use the sequence number to make sure I'm not missing lines from a host. On another nearby node, I started a process to store all of the log lines originating from this cp1044. I then started the Flume agent and waited a 3 minutes for it to roll files 3 times. I currently have 4 HDFS sinks going, so this created a total of 12 files. I got the files out of HDFS, and then sorted on their sequence numbers to gain the first and last sequence number in this set of files. I took those two border sequence numbers and extracted all of the log lines generated by cp1044 on the nearby host (not using Flume). I should be able to compare the number of lines here with the number of lines in the 12 files I extracted from HDFS and Flume. If they are the same, then Flume and UDPSource is working! Flume saved 19451 events to HDFS, and the number of raw events recorded outside of Flume and HDFS was 78176. I'm up to about 25% of data! Better but still not good enough. :( This was for about 3 minutes of data, so for a single host, this shouldn't be more than 500 events per second. I must be doing something really wrong on the Flume tweaky side of things, eh? Any more ideas? Thanks! P.S. YOU GUYS ARE SO HELPFUL. Thanks so much for everything thus far. On Jan 17, 2013, at 10:34 AM, Andrew Otto <[EMAIL PROTECTED]> wrote: >> with UDP there's no guaranty that the data will reach destination. > > True, but I'm experimenting with using Flume as a replacement for a system that is already in place. I actually got the numbers I listed below by grabbing data directly off of the UDP stream and saving them to a file on local disk. Its possible that UDP data is getting lost in the network somewhere, but if that were the case I wouldn't know about it. I am comparing Flume's performance to a single process writing to a local disk. > > +
Andrew Otto 2013-01-17, 16:26
-
Re: Need for UDP / Multicast SourceAndrew Otto 2013-01-17, 17:36
> I took those two border sequence numbers and extracted all of the log lines generated by cp1044 on the nearby host (not using Flume). I should be able to compare the number of lines here with the number of lines in the 12 files I extracted from HDFS and Flume. If they are the same, then Flume and UDPSource is working!
Oh, I meant to link to a Gist with my current flume.conf and the commands I executed to investigate this. Here it is: https://gist.github.com/4557178 +
Andrew Otto 2013-01-17, 17:36
-
Re: Need for UDP / Multicast SourceJeff Lord 2013-01-17, 17:59
Hi Andrew,
You may try lowering transactionCapacity here. The transactionCapacity should be set to the value of the largest batch size that will be used to store or remove events from that channel. You currently have it equal to the capacity of the channel. So essentially the channel *could be* filled with one transaction depending on how you are batching with your client. Also it may be useful to turn up jmx monitoring and watch the channel counters using jconsole. This way you can see exactly how many events are placed in the channel. To do this you will need to Set the following Java system properties located at, /etc/flume-ng/conf/flume-env.sh. com.sun.management.jmxremote com.sun.management.jmxremote.port=8081 com.sun.management.jmxremote.authenticate=false com.sun.management.jmxremote.ssl=false You should than be able to connect with jconsole hostname:8081 -Jeff On Thu, Jan 17, 2013 at 9:36 AM, Andrew Otto <[EMAIL PROTECTED]> wrote: > > I took those two border sequence numbers and extracted all of the log > lines generated by cp1044 on the nearby host (not using Flume). I should > be able to compare the number of lines here with the number of lines in the > 12 files I extracted from HDFS and Flume. If they are the same, then Flume > and UDPSource is working! > > Oh, I meant to link to a Gist with my current flume.conf and the commands > I executed to investigate this. Here it is: > > https://gist.github.com/4557178 > > > > +
Jeff Lord 2013-01-17, 17:59
-
Re: Need for UDP / Multicast SourceBrock Noland 2013-01-17, 18:04
Yeah what jeff said. It would be interesting to know which component
cannot keep up, the source or sink. If the sink cannot keep up you'll see a growing channel size. I have written something similar to read events via UDP before. I found that because UDP can so easily drop data, I needed a thread dedicated to reading the events and then immediately hand them off another thread to do anything interesting. It's possible you are in this scenario. On Thu, Jan 17, 2013 at 9:59 AM, Jeff Lord <[EMAIL PROTECTED]> wrote: > Hi Andrew, > > You may try lowering transactionCapacity here. > The transactionCapacity should be set to the value of the largest batch size > that will be used to store or remove events from that channel. You currently > have it equal to the capacity of the channel. So essentially the channel > *could be* filled with one transaction depending on how you are batching > with your client. > > Also it may be useful to turn up jmx monitoring and watch the channel > counters using jconsole. This way you can see exactly how many events are > placed in the channel. > > To do this you will need to Set the following Java system properties located > at, > /etc/flume-ng/conf/flume-env.sh. > > com.sun.management.jmxremote > com.sun.management.jmxremote.port=8081 > com.sun.management.jmxremote.authenticate=false > com.sun.management.jmxremote.ssl=false > > You should than be able to connect with > jconsole hostname:8081 > > -Jeff > > > > > On Thu, Jan 17, 2013 at 9:36 AM, Andrew Otto <[EMAIL PROTECTED]> wrote: >> >> > I took those two border sequence numbers and extracted all of the log >> > lines generated by cp1044 on the nearby host (not using Flume). I should be >> > able to compare the number of lines here with the number of lines in the 12 >> > files I extracted from HDFS and Flume. If they are the same, then Flume and >> > UDPSource is working! >> >> Oh, I meant to link to a Gist with my current flume.conf and the commands >> I executed to investigate this. Here it is: >> >> https://gist.github.com/4557178 >> >> >> > -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ +
Brock Noland 2013-01-17, 18:04
-
Re: Need for UDP / Multicast SourceAndrew Otto 2013-01-17, 18:56
> Also it may be useful to turn up jmx monitoring and watch the channel counters using console.
Yeah, I've done this too. Everything source->sink related seems to keep up. > Can you see if something like 'netstat -su' on sources and destination flume nodes shows any problems. Sorry, I should have done this earlier. 'packet receive errors' steadily increases on my Flume node while flume is running. Also: netstat -anu | grep 8420 udp6 228096 0 233.58.59.1:8420 :::* RecvQ is pretty full there! It looks to me like Flume isn't reading off of the UDP queue and do the channel put fast enough. > I found that because UDP can so easily drop data, I needed a thread > dedicated to reading the events and then immediately hand them off > another thread to do anything interesting. It's possible you are in > this scenario. Yeah, sigh. Maybe so. One of the reasons we are trying Flume is to get ourselves out of this situation though! +
Andrew Otto 2013-01-17, 18:56
-
Re: Need for UDP / Multicast SourceAndrew Otto 2013-01-17, 17:33
> Since each sink really has just one thread driving them, adding multiple sinks might help.
Oh hey, how does hdfs.threadsPoolSize relate to adding multiple sinks? The docs say this is the Number of threads per HDFS sink for HDFS IO ops (open, write, etc.) I've got 24 cores (12 + hyperthreading) on the machine I'm using to test this stuff. I only see one under heavy use. There are currently 98 flume threads running, and they are (relatively) spread out across all of the CPUs. I'm starting to suspect that the source thread just can't keep up with all of the incoming UDP data, so it is dropping packets somewhere. When this happens with another C program that we use to consume this stream internally, I see the 'drops' counter increase for the port in /proc/<pid>/net/udp, but I am not seeing this happen now. Is there a way to know if the JVM (or in this case Netty?) is dropping UDP packets? As far as I can tell, Java's UDP interface is just a wrapper around the native UDP socket implementation, so there shouldn't be anything hidden here. Or maybe there is some sneaky JVM/Netty buffering going on that I don't know about? -Andrew +
Andrew Otto 2013-01-17, 17:33
|