Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> Need for UDP / Multicast Source


+
Andrew Otto 2013-01-14, 17:29
+
Hari Shreedharan 2013-01-14, 17:37
+
Alexander Alten-Lorenz 2013-01-14, 17:43
+
Andrew Otto 2013-01-14, 18:01
+
Andrew Otto 2013-01-15, 19:31
+
Andrew Otto 2013-01-16, 21:22
+
Brock Noland 2013-01-16, 21:36
+
Andrew Otto 2013-01-16, 22:30
+
Brock Noland 2013-01-16, 22:34
+
Hari Shreedharan 2013-01-16, 22:47
+
Andrew Otto 2013-01-16, 23:03
+
Hari Shreedharan 2013-01-16, 23:09
Copy link to this message
-
Re: Need for UDP / Multicast Source
My be a stupid question, but since you're working with UDP, are you sure
all your data is making through to flume.
with UDP there's no guaranty that the data will reach destination.

Can you see if something like 'netstat -su' on sources and destination
flume nodes shows any problems.

Bhaskar

On Wed, Jan 16, 2013 at 6:09 PM, Hari Shreedharan <[EMAIL PROTECTED]
> wrote:

>  No, each sink will not consume the same data. If data is taken and
> committed from a channel, only the sink which took it will see it. When a
> sink calls take, no other sink will be able to access the data (though it
> is still in the channel) unless the transaction is rolled back (or in case
> of the FileChannel, the channel gets restarted due to agent restart or
> reconfig). If you have a sink processor, only one of the n sinks in the
> group is active at one time (basically there is one thread running the n
> sinks, polling them based on the sink processor's decision on which sink to
> poll). Without a sink processor, each sink gets its own sink runner
> thread.
>
>
> Hari
>
> --
> Hari Shreedharan
>
> On Wednesday, January 16, 2013 at 3:03 PM, Andrew Otto wrote:
>
> Ok, thanks.  Quick Q:  Won't each sink consume the same data?  Do I need
> to set up the load balancing sink processor to keep that from happening?
>
>
> On Jan 16, 2013, at 5:47 PM, Hari Shreedharan <[EMAIL PROTECTED]>
> wrote:
>
>  Also can you try adding more HDFS sinks reading from the same channel.
> I'd recommend using different file prefixes, or paths for each sink, to
> avoid collision. Since each sink really has just one thread driving them,
> adding multiple sinks might help. Also, keep an eye on the memory channel's
> sizes and see if it is filling up (there will be ChannelExceptions in the
> logs if it is).
>
>
> Hari
>
> --
> Hari Shreedharan
>
> On Wednesday, January 16, 2013 at 2:34 PM, Brock Noland wrote:
>
> Good to hear! Take five six thread dumps of it and then them our way.
>
> On Wed, Jan 16, 2013 at 2:30 PM, Andrew Otto <[EMAIL PROTECTED]> wrote:
>
> Cool, thanks for the advice! That's a great blog post.
>
> I've changed my ways (for now at least). I've got lots of disks to use
> once memory starts working, and this node has tooooons of memory (192G).
>
> Here's my new flume.conf:
> https://gist.github.com/4551513
>
> This is doing better, for sure. Note that I took out the timestamp
> regex_extractor just in case that was impacting performance. I'm using the
> regular ol' timestamp interceptor now.
>
> I'm still not doing so great though. I'm getting about 300 Mb per minute
> in my HDFS files. I should be getting about 300G. That's better than before
> though. I've got 10% of the data this time, rather than 0.14% :)
>
>
>
>
> On Jan 16, 2013, at 4:36 PM, Brock Noland <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I would use memory channel for now as opposed to file channel. For
> file channel to keep up with that you'd need multiple disks. Also your
> checkpoint period is super-low which will cause lots of checkpoints
> and slow things down.
>
> However, I think the biggest issue is probably batch size. With that
> much data you are likely going to want a large batch size for all
> components involved. Something a low multiple of 1000. There is a good
> article on this:
> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1
>
> To re-cap would:
>
> Use memory channel for now. Once you prove things work you can work on
> tuning file channel (going to write larger batch sizes and multiple
> disks).
>
> Increase the batch size for your source/sink.
>
> On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <[EMAIL PROTECTED]> wrote:
>
> Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream.
> This is available to me via UDP Multicast. Everything seems to be working
> great, except that I seem to be missing a lot of data.
>
> Our webrequest log stream consists of about 100000 events per second,
> which amounts to around 50 Mb per second.
+
Andrew Otto 2013-01-17, 15:34
+
Andrew Otto 2013-01-17, 16:26
+
Andrew Otto 2013-01-17, 17:36
+
Jeff Lord 2013-01-17, 17:59
+
Brock Noland 2013-01-17, 18:04
+
Andrew Otto 2013-01-17, 18:56
+
Andrew Otto 2013-01-17, 17:33
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB