|
Sagar Mehta
2013-01-14, 19:18
Connor Woodson
2013-01-14, 22:25
Sagar Mehta
2013-01-14, 22:34
Connor Woodson
2013-01-14, 22:52
Sagar Mehta
2013-01-14, 23:12
Brock Noland
2013-01-14, 23:16
Connor Woodson
2013-01-14, 23:17
Sagar Mehta
2013-01-14, 23:24
Sagar Mehta
2013-01-14, 23:27
Brock Noland
2013-01-14, 23:38
Sagar Mehta
2013-01-15, 00:43
Brock Noland
2013-01-15, 00:54
Sagar Mehta
2013-01-15, 01:03
Connor Woodson
2013-01-15, 01:17
Bhaskar V. Karambelkar
2013-01-15, 01:25
Connor Woodson
2013-01-15, 01:26
Sagar Mehta
2013-01-15, 01:52
Sagar Mehta
2013-01-15, 02:36
|
-
Question about gzip compression when using Flume NgSagar Mehta 2013-01-14, 19:18
Hi Guys,
I'm using Flume Ng and it works great for me. In essence I'm using an exec source for doing tail -F on a logfile and using two HDFS sinks using a File channel. So far so great - Now I'm trying to use gzip compression using the following config as per the Flume-Ng User guide at http://flume.apache.org/FlumeUserGuide.html. #gzip compression related settings collector102.sinks.sink1.hdfs.codeC = gzip collector102.sinks.sink1.hdfs.fileType = CompressedStream collector102.sinks.sink1.hdfs.fileSuffix = .gz However this is what looks to be happening *Flume seems to write gzipped compressed output [I see the .gz files in the output buckets], however when I try to decompress it - I get an error about 'trailing garbage ignored' and the decompressed output is in fact smaller in size.* hadoop@jobtracker301:/home/hadoop/sagar/temp$ ls -ltr collector102.ngpipes.sac.ngmoco.com.1357936638713.gz -rw-r--r-- 1 hadoop hadoop *5381235* 2013-01-11 20:44 *collector102.ngpipes.sac.ngmoco.com.1357936638713.gz* hadoop@jobtracker301:/home/hadoop/sagar/temp$ gunzip collector102.ngpipes.sac.ngmoco.com.1357936638713.gz *gzip: collector102.ngpipes.sac.ngmoco.com.1357936638713.gz: decompression OK, trailing garbage ignored* * * hadoop@jobtracker301:/home/hadoop/sagar/temp$ ls -l -rw-r--r-- 1 hadoop hadoop *58898* 2013-01-11 20:44 * collector102.ngpipes.sac.ngmoco.com.1357936638713* * * *Below are some helpful details.* * * *I'm using apache-flume-1.4.0-SNAPSHOT-bin* * * smehta@collector102:/opt$ ls -l flume lrwxrwxrwx 1 root root 31 2012-12-14 00:44 flume -> apache-flume-1.4.0-SNAPSHOT-bin *I also have the hadoop-core jar in my path* smehta@collector102:/opt/flume/lib$ ls -l hadoop-core-0.20.2-cdh3u2.jar -rw-r--r-- 1 hadoop hadoop 3534499 2012-12-01 01:53 hadoop-core-0.20.2-cdh3u2.jar * * Everything is working well for me except the compression part. I'm not quite sure what I'm missing here. So while I debug this, any ideas/help is much appreciated. Thanks in advance, Sagar
-
Re: Question about gzip compression when using Flume NgConnor Woodson 2013-01-14, 22:25
Can you post your full config?
- Connor On Mon, Jan 14, 2013 at 11:18 AM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > Hi Guys, > > I'm using Flume Ng and it works great for me. In essence I'm using an exec > source for doing tail -F on a logfile and using two HDFS sinks using a > File channel. So far so great - Now I'm trying to use gzip compression > using the following config as per the Flume-Ng User guide at > http://flume.apache.org/FlumeUserGuide.html. > > #gzip compression related settings > collector102.sinks.sink1.hdfs.codeC = gzip > collector102.sinks.sink1.hdfs.fileType = CompressedStream > collector102.sinks.sink1.hdfs.fileSuffix = .gz > > However this is what looks to be happening > > *Flume seems to write gzipped compressed output [I see the .gz files in > the output buckets], however when I try to decompress it - I get an error > about 'trailing garbage ignored' and the decompressed output is in fact > smaller in size.* > > hadoop@jobtracker301:/home/hadoop/sagar/temp$ ls -ltr > collector102.ngpipes.sac.ngmoco.com.1357936638713.gz > -rw-r--r-- 1 hadoop hadoop *5381235* 2013-01-11 20:44 > *collector102.ngpipes.sac.ngmoco.com.1357936638713.gz* > > hadoop@jobtracker301:/home/hadoop/sagar/temp$ gunzip > collector102.ngpipes.sac.ngmoco.com.1357936638713.gz > > *gzip: collector102.ngpipes.sac.ngmoco.com.1357936638713.gz: > decompression OK, trailing garbage ignored* > * > * > hadoop@jobtracker301:/home/hadoop/sagar/temp$ ls -l > > -rw-r--r-- 1 hadoop hadoop *58898* 2013-01-11 20:44 * > collector102.ngpipes.sac.ngmoco.com.1357936638713* > * > * > *Below are some helpful details.* > * > * > *I'm using apache-flume-1.4.0-SNAPSHOT-bin* > * > * > smehta@collector102:/opt$ ls -l flume > lrwxrwxrwx 1 root root 31 2012-12-14 00:44 flume -> > apache-flume-1.4.0-SNAPSHOT-bin > > *I also have the hadoop-core jar in my path* > > smehta@collector102:/opt/flume/lib$ ls -l hadoop-core-0.20.2-cdh3u2.jar > -rw-r--r-- 1 hadoop hadoop 3534499 2012-12-01 01:53 > hadoop-core-0.20.2-cdh3u2.jar > * > * > Everything is working well for me except the compression part. I'm not > quite sure what I'm missing here. So while I debug this, any ideas/help is > much appreciated. > > Thanks in advance, > Sagar >
-
Re: Question about gzip compression when using Flume NgSagar Mehta 2013-01-14, 22:34
Yeah sure!!
smehta@collector102:/opt/flume/conf$ cat hdfs.conf # hdfs.conf: This is a configuration file to configures Flume NG to use # An exec source to get a live tail of the jetty logFile # An hdfs sink to write events to the hdfs on the test cluster # A file based channel to connect the above source and sink # Name the components on this agent collector102.sources = source1 collector102.sinks = sink1 sink2 collector102.channels = channel1 channel2 # Configure the source collector102.sources.source1.type = exec collector102.sources.source1.command = tail -F /opt/jetty/logFile.log # Configure the interceptors collector102.sources.source1.interceptors = TimestampInterceptor HostInterceptor # We use the Timestamp interceptor to get timestamps of when flume receives events # This is used for figuring out the bucket to which an event goes collector102.sources.source1.interceptors.TimestampInterceptor.type timestamp # We use the Host interceptor to populate the host header with the fully qualified domain name of the collector. # That way we know which file in the sink respresents which collector. collector102.sources.source1.interceptors.HostInterceptor.type org.apache.flume.interceptor.HostInterceptor$Builder collector102.sources.source1.interceptors.HostInterceptor.preserveExisting = false collector102.sources.source1.interceptors.HostInterceptor.useIP = false collector102.sources.source1.interceptors.HostInterceptor.hostHeader = host # Configure the sink collector102.sinks.sink1.type = hdfs # Configure the bucketing collector102.sinks.sink1.hdfs.path=hdfs:// namenode301.ngpipes.milp.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00 # Prefix the file with the source so that we know where the events in the file came from collector102.sinks.sink1.hdfs.filePrefix = %{host} # We roll the flume output file based on time interval - currently every 5 minutes collector102.sinks.sink1.hdfs.rollSize = 0 collector102.sinks.sink1.hdfs.rollCount = 0 collector102.sinks.sink1.hdfs.rollInterval = 300 #gzip compression related settings collector102.sinks.sink1.hdfs.codeC = gzip collector102.sinks.sink1.hdfs.fileType = CompressedStream collector102.sinks.sink1.hdfs.fileSuffix = .gz # Configure the sink collector102.sinks.sink2.type = hdfs # Configure the bucketing collector102.sinks.sink2.hdfs.path=hdfs:// namenode5001.ngpipes.sac.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00 # Prefix the file with the source so that we know where the events in the file came from collector102.sinks.sink2.hdfs.filePrefix = %{host} # We roll the flume output file based on time interval - currently every 5 minutes collector102.sinks.sink2.hdfs.rollSize = 0 collector102.sinks.sink2.hdfs.rollCount = 0 collector102.sinks.sink2.hdfs.rollInterval = 300 collector102.sinks.sink2.hdfs.fileType = DataStream # Configure the channel that connects the source to the sink # Use a channel which buffers events in filesystem collector102.channels.channel1.type = file collector102.channels.channel1.checkpointDir /data/flume_data/channel1/checkpoint collector102.channels.channel1.dataDirs = /data/flume_data/channel1/data # Use a channel which buffers events in filesystem collector102.channels.channel2.type = file collector102.channels.channel2.checkpointDir /data/flume_data/channel2/checkpoint collector102.channels.channel2.dataDirs = /data/flume_data/channel2/data # Bind the source and sink to the channel configured above collector102.sources.source1.channels = channel1 channel2 collector102.sinks.sink1.channel = channel1 collector102.sinks.sink2.channel = channel2 On Mon, Jan 14, 2013 at 2:25 PM, Connor Woodson <[EMAIL PROTECTED]>wrote: > Can you post your full config? > > - Connor > > > On Mon, Jan 14, 2013 at 11:18 AM, Sagar Mehta <[EMAIL PROTECTED]>wrote: > >> Hi Guys, >> >> I'm using Flume Ng and it works great for me. In essence I'm using an >> exec source for doing tail -F on a logfile and using two HDFS sinks using >> a File channel. So far so great - Now I'm trying to use gzip compression
-
Re: Question about gzip compression when using Flume NgConnor Woodson 2013-01-14, 22:52
Try adding:
collector102.sinks.sink1.hdfs.writeFormat = TEXT collector102.sinks.sink2.hdfs.writeFormat = TEXT - Connor On Mon, Jan 14, 2013 at 2:34 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > Yeah sure!! > > smehta@collector102:/opt/flume/conf$ cat hdfs.conf > # hdfs.conf: This is a configuration file to configures Flume NG to use > # An exec source to get a live tail of the jetty logFile > # An hdfs sink to write events to the hdfs on the test cluster > # A file based channel to connect the above source and sink > > # Name the components on this agent > collector102.sources = source1 > collector102.sinks = sink1 sink2 > collector102.channels = channel1 channel2 > > # Configure the source > collector102.sources.source1.type = exec > collector102.sources.source1.command = tail -F /opt/jetty/logFile.log > > # Configure the interceptors > collector102.sources.source1.interceptors = TimestampInterceptor > HostInterceptor > > # We use the Timestamp interceptor to get timestamps of when flume > receives events > # This is used for figuring out the bucket to which an event goes > collector102.sources.source1.interceptors.TimestampInterceptor.type > timestamp > > # We use the Host interceptor to populate the host header with the fully > qualified domain name of the collector. > # That way we know which file in the sink respresents which collector. > collector102.sources.source1.interceptors.HostInterceptor.type > org.apache.flume.interceptor.HostInterceptor$Builder > collector102.sources.source1.interceptors.HostInterceptor.preserveExisting > = false > collector102.sources.source1.interceptors.HostInterceptor.useIP = false > collector102.sources.source1.interceptors.HostInterceptor.hostHeader = host > > # Configure the sink > > collector102.sinks.sink1.type = hdfs > > # Configure the bucketing > collector102.sinks.sink1.hdfs.path=hdfs:// > namenode301.ngpipes.milp.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00 > > # Prefix the file with the source so that we know where the events in the > file came from > collector102.sinks.sink1.hdfs.filePrefix = %{host} > > # We roll the flume output file based on time interval - currently every 5 > minutes > collector102.sinks.sink1.hdfs.rollSize = 0 > collector102.sinks.sink1.hdfs.rollCount = 0 > collector102.sinks.sink1.hdfs.rollInterval = 300 > > #gzip compression related settings > collector102.sinks.sink1.hdfs.codeC = gzip > collector102.sinks.sink1.hdfs.fileType = CompressedStream > collector102.sinks.sink1.hdfs.fileSuffix = .gz > > # Configure the sink > > collector102.sinks.sink2.type = hdfs > > # Configure the bucketing > collector102.sinks.sink2.hdfs.path=hdfs:// > namenode5001.ngpipes.sac.ngmoco.com:9000/ngpipes-raw-logs/%Y-%m-%d/%H00 > > # Prefix the file with the source so that we know where the events in the > file came from > collector102.sinks.sink2.hdfs.filePrefix = %{host} > > # We roll the flume output file based on time interval - currently every 5 > minutes > collector102.sinks.sink2.hdfs.rollSize = 0 > collector102.sinks.sink2.hdfs.rollCount = 0 > collector102.sinks.sink2.hdfs.rollInterval = 300 > collector102.sinks.sink2.hdfs.fileType = DataStream > > # Configure the channel that connects the source to the sink > > # Use a channel which buffers events in filesystem > collector102.channels.channel1.type = file > collector102.channels.channel1.checkpointDir > /data/flume_data/channel1/checkpoint > collector102.channels.channel1.dataDirs = /data/flume_data/channel1/data > > # Use a channel which buffers events in filesystem > collector102.channels.channel2.type = file > collector102.channels.channel2.checkpointDir > /data/flume_data/channel2/checkpoint > collector102.channels.channel2.dataDirs = /data/flume_data/channel2/data > > # Bind the source and sink to the channel configured above > collector102.sources.source1.channels = channel1 channel2 > collector102.sinks.sink1.channel = channel1 > collector102.sinks.sink2.channel = channel2 > > On Mon, Jan 14, 2013 at 2:25 PM, Connor Woodson <[EMAIL PROTECTED]>wrote:
-
Re: Question about gzip compression when using Flume NgSagar Mehta 2013-01-14, 23:12
Yeah I have tried the text write format in vain before,
but nevertheless gave it a try again!! Below is the latest file - still the same thing. hadoop@jobtracker301:/home/hadoop/sagar/debug$ date *Mon Jan 14 23:02:07 UTC 2013* hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz Found 1 items -rw-r--r-- 3 hadoop supergroup 4798117 *2013-01-14* *22:55 * /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz . hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip collector102.ngpipes.sac.ngmoco.com.1358204141600.gz *gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz: decompression OK, trailing garbage ignored* * * *Interestingly enough, the gzip page says it is a harmless warning - http://www.gzip.org/#faq8* However, I'm losing events on decompression so I cannot afford to ignore this warning. The gzip page gives an example about magnetic tape - there is an analogy of hdfs block here since the file is initially stored in hdfs before I pull it out on the local filesystem. Sagar On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson <[EMAIL PROTECTED]>wrote: > collector102.sinks.sink1.hdfs.writeFormat = TEXT > collector102.sinks.sink2.hdfs.writeFormat = TEXT >
-
Re: Question about gzip compression when using Flume NgBrock Noland 2013-01-14, 23:16
Hi,
Can you try: zcat file > output I think what is occurring is because of the flush the output file is actually several concatenated gz files. Brock On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > Yeah I have tried the text write format in vain before, but nevertheless > gave it a try again!! Below is the latest file - still the same thing. > > hadoop@jobtracker301:/home/hadoop/sagar/debug$ date > Mon Jan 14 23:02:07 UTC 2013 > > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > Found 1 items > -rw-r--r-- 3 hadoop supergroup 4798117 2013-01-14 22:55 > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > . > hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip > collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > > gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz: decompression > OK, trailing garbage ignored > > Interestingly enough, the gzip page says it is a harmless warning - > http://www.gzip.org/#faq8 > > However, I'm losing events on decompression so I cannot afford to ignore > this warning. The gzip page gives an example about magnetic tape - there is > an analogy of hdfs block here since the file is initially stored in hdfs > before I pull it out on the local filesystem. > > Sagar > > > > > On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson <[EMAIL PROTECTED]> > wrote: >> >> collector102.sinks.sink1.hdfs.writeFormat = TEXT >> collector102.sinks.sink2.hdfs.writeFormat = TEXT > > > -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
-
Re: Question about gzip compression when using Flume NgConnor Woodson 2013-01-14, 23:17
What if you switch to bz2 compression?
On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > Yeah I have tried the text write format in vain before, > but nevertheless gave it a try again!! Below is the latest file - still the > same thing. > > hadoop@jobtracker301:/home/hadoop/sagar/debug$ date > *Mon Jan 14 23:02:07 UTC 2013* > > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > Found 1 items > -rw-r--r-- 3 hadoop supergroup 4798117 *2013-01-14* *22:55 * > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > . > hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip > collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > > *gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz: > decompression OK, trailing garbage ignored* > * > * > *Interestingly enough, the gzip page says it is a harmless warning - > http://www.gzip.org/#faq8* > > However, I'm losing events on decompression so I cannot afford to ignore > this warning. The gzip page gives an example about magnetic tape - there is > an analogy of hdfs block here since the file is initially stored in hdfs > before I pull it out on the local filesystem. > > Sagar > > > > > On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson <[EMAIL PROTECTED]>wrote: > >> collector102.sinks.sink1.hdfs.writeFormat = TEXT >> collector102.sinks.sink2.hdfs.writeFormat = TEXT >> > > >
-
Re: Question about gzip compression when using Flume NgSagar Mehta 2013-01-14, 23:24
hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat
collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz: decompression OK, trailing garbage ignored 100 This should be about 50,000 events for the 5 min window!! Sagar On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <[EMAIL PROTECTED]> wrote: > Hi, > > Can you try: zcat file > output > > I think what is occurring is because of the flush the output file is > actually several concatenated gz files. > > Brock > > On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > > Yeah I have tried the text write format in vain before, but nevertheless > > gave it a try again!! Below is the latest file - still the same thing. > > > > hadoop@jobtracker301:/home/hadoop/sagar/debug$ date > > Mon Jan 14 23:02:07 UTC 2013 > > > > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls > > > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > > Found 1 items > > -rw-r--r-- 3 hadoop supergroup 4798117 2013-01-14 22:55 > > > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > > > > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget > > > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > > . > > hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip > > collector102.ngpipes.sac.ngmoco.com.1358204141600.gz > > > > gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz: decompression > > OK, trailing garbage ignored > > > > Interestingly enough, the gzip page says it is a harmless warning - > > http://www.gzip.org/#faq8 > > > > However, I'm losing events on decompression so I cannot afford to ignore > > this warning. The gzip page gives an example about magnetic tape - there > is > > an analogy of hdfs block here since the file is initially stored in hdfs > > before I pull it out on the local filesystem. > > > > Sagar > > > > > > > > > > On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson <[EMAIL PROTECTED]> > > wrote: > >> > >> collector102.sinks.sink1.hdfs.writeFormat = TEXT > >> collector102.sinks.sink2.hdfs.writeFormat = TEXT > > > > > > > > > > -- > Apache MRUnit - Unit testing MapReduce - > http://incubator.apache.org/mrunit/ >
-
Re: Question about gzip compression when using Flume NgSagar Mehta 2013-01-14, 23:27
About the bz2 suggestion, we have a ton of downstream jobs that assume gzip
compressed files - so it is better to stick to gzip. The plan B for us is to have a Oozie step to gzip compress the logs before proceeding with downstream Hadoop jobs - but that looks like a hack to me!! Sagar On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat > collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l > > gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz: decompression > OK, trailing garbage ignored > 100 > > This should be about 50,000 events for the 5 min window!! > > Sagar > > On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> Can you try: zcat file > output >> >> I think what is occurring is because of the flush the output file is >> actually several concatenated gz files. >> >> Brock >> >> On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <[EMAIL PROTECTED]> >> wrote: >> > Yeah I have tried the text write format in vain before, but nevertheless >> > gave it a try again!! Below is the latest file - still the same thing. >> > >> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ date >> > Mon Jan 14 23:02:07 UTC 2013 >> > >> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls >> > >> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >> > Found 1 items >> > -rw-r--r-- 3 hadoop supergroup 4798117 2013-01-14 22:55 >> > >> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >> > >> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget >> > >> /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >> > . >> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip >> > collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >> > >> > gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz: >> decompression >> > OK, trailing garbage ignored >> > >> > Interestingly enough, the gzip page says it is a harmless warning - >> > http://www.gzip.org/#faq8 >> > >> > However, I'm losing events on decompression so I cannot afford to ignore >> > this warning. The gzip page gives an example about magnetic tape - >> there is >> > an analogy of hdfs block here since the file is initially stored in hdfs >> > before I pull it out on the local filesystem. >> > >> > Sagar >> > >> > >> > >> > >> > On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson <[EMAIL PROTECTED] >> > >> > wrote: >> >> >> >> collector102.sinks.sink1.hdfs.writeFormat = TEXT >> >> collector102.sinks.sink2.hdfs.writeFormat = TEXT >> > >> > >> > >> >> >> >> -- >> Apache MRUnit - Unit testing MapReduce - >> http://incubator.apache.org/mrunit/ >> > >
-
Re: Question about gzip compression when using Flume NgBrock Noland 2013-01-14, 23:38
Hmm, could you try and updated version of Hadoop? CDH3u2 is quite old,
I would upgrade to CDH3u5 or CDH 4.1.2. On Mon, Jan 14, 2013 at 3:27 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > About the bz2 suggestion, we have a ton of downstream jobs that assume gzip > compressed files - so it is better to stick to gzip. > > The plan B for us is to have a Oozie step to gzip compress the logs before > proceeding with downstream Hadoop jobs - but that looks like a hack to me!! > > Sagar > > > On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: >> >> hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat >> collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l >> >> gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz: decompression >> OK, trailing garbage ignored >> 100 >> >> This should be about 50,000 events for the 5 min window!! >> >> Sagar >> >> On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <[EMAIL PROTECTED]> wrote: >>> >>> Hi, >>> >>> Can you try: zcat file > output >>> >>> I think what is occurring is because of the flush the output file is >>> actually several concatenated gz files. >>> >>> Brock >>> >>> On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <[EMAIL PROTECTED]> >>> wrote: >>> > Yeah I have tried the text write format in vain before, but >>> > nevertheless >>> > gave it a try again!! Below is the latest file - still the same thing. >>> > >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ date >>> > Mon Jan 14 23:02:07 UTC 2013 >>> > >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hls >>> > >>> > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >>> > Found 1 items >>> > -rw-r--r-- 3 hadoop supergroup 4798117 2013-01-14 22:55 >>> > >>> > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >>> > >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget >>> > >>> > /ngpipes-raw-logs/2013-01-14/2200/collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >>> > . >>> > hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip >>> > collector102.ngpipes.sac.ngmoco.com.1358204141600.gz >>> > >>> > gzip: collector102.ngpipes.sac.ngmoco.com.1358204141600.gz: >>> > decompression >>> > OK, trailing garbage ignored >>> > >>> > Interestingly enough, the gzip page says it is a harmless warning - >>> > http://www.gzip.org/#faq8 >>> > >>> > However, I'm losing events on decompression so I cannot afford to >>> > ignore >>> > this warning. The gzip page gives an example about magnetic tape - >>> > there is >>> > an analogy of hdfs block here since the file is initially stored in >>> > hdfs >>> > before I pull it out on the local filesystem. >>> > >>> > Sagar >>> > >>> > >>> > >>> > >>> > On Mon, Jan 14, 2013 at 2:52 PM, Connor Woodson >>> > <[EMAIL PROTECTED]> >>> > wrote: >>> >> >>> >> collector102.sinks.sink1.hdfs.writeFormat = TEXT >>> >> collector102.sinks.sink2.hdfs.writeFormat = TEXT >>> > >>> > >>> > >>> >>> >>> >>> -- >>> Apache MRUnit - Unit testing MapReduce - >>> http://incubator.apache.org/mrunit/ >> >> > -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
-
Re: Question about gzip compression when using Flume NgSagar Mehta 2013-01-15, 00:43
ok so I dropped in the new hadoop-core jar in /opt/flume/lib [I got some
errors about the guava dependencies so put in that jar too] smehta@collector102:/opt/flume/lib$ ls -ltrh | grep -e "hadoop-core" -e "guava" -rw-r--r-- 1 hadoop hadoop 1.5M 2012-11-14 21:49 guava-10.0.1.jar -rw-r--r-- 1 hadoop hadoop 3.7M 2013-01-14 23:50 hadoop-core-0.20.2-cdh3u5.jar Now I don't event see the file being created in hdfs and the flume log is happily talking about housekeeping for some file channel checkpoints, updating pointers et al Below is tail of flume log *hadoop@collector102:/data/flume_log$ tail -10 flume.log* 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO org.apache.flume.channel.file.Log - Updated checkpoint for file: /data/flume_data/channel2/data/log-36 position: 129415524 logWriteOrderID: 1358209947324 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO org.apache.flume.channel.file.LogFile - Closing RandomReader /data/flume_data/channel2/data/log-34 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO org.apache.flume.channel.file.Log - Updated checkpoint for file: /data/flume_data/channel1/data/log-36 position: 129415524 logWriteOrderID: 1358209947323 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO org.apache.flume.channel.file.LogFile - Closing RandomReader /data/flume_data/channel1/data/log-34 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel2] INFO org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta currentPosition = 18577138, logWriteOrderID = 1358209947324 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel1] INFO org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta currentPosition = 18577138, logWriteOrderID = 1358209947323 2013-01-15 00:42:10,820 [Log-BackgroundWorker-channel1] INFO org.apache.flume.channel.file.LogFile - Closing RandomReader /data/flume_data/channel1/data/log-35 2013-01-15 00:42:10,821 [Log-BackgroundWorker-channel2] INFO org.apache.flume.channel.file.LogFile - Closing RandomReader /data/flume_data/channel2/data/log-35 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel1] INFO org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta currentPosition = 217919486, logWriteOrderID = 1358209947323 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel2] INFO org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta currentPosition = 217919486, logWriteOrderID = 1358209947324 Sagar On Mon, Jan 14, 2013 at 3:38 PM, Brock Noland <[EMAIL PROTECTED]> wrote: > Hmm, could you try and updated version of Hadoop? CDH3u2 is quite old, > I would upgrade to CDH3u5 or CDH 4.1.2. > > On Mon, Jan 14, 2013 at 3:27 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > > About the bz2 suggestion, we have a ton of downstream jobs that assume > gzip > > compressed files - so it is better to stick to gzip. > > > > The plan B for us is to have a Oozie step to gzip compress the logs > before > > proceeding with downstream Hadoop jobs - but that looks like a hack to > me!! > > > > Sagar > > > > > > On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <[EMAIL PROTECTED]> > wrote: > >> > >> hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat > >> collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l > >> > >> gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz: > decompression > >> OK, trailing garbage ignored > >> 100 > >> > >> This should be about 50,000 events for the 5 min window!! > >> > >> Sagar > >> > >> On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <[EMAIL PROTECTED]> > wrote: > >>> > >>> Hi, > >>> > >>> Can you try: zcat file > output > >>> > >>> I think what is occurring is because of the flush the output file is > >>> actually several concatenated gz files. > >>> > >>> Brock > >>> > >>> On Mon, Jan 14, 2013 at 3:12 PM, Sagar Mehta <[EMAIL PROTECTED]> > >>> wrote: > >>> > Yeah I have tried the text write format in vain before, but > >>> > nevertheless > >>> > gave it a try again!! Below is the latest file - still the same
-
Re: Question about gzip compression when using Flume NgBrock Noland 2013-01-15, 00:54
Hi,
That's just the file channel. The HDFSEventSink will need a heck of a lot more than the just those two jars. To override the version of hadoop it will find from the hadoop command you probably want to set HADOOP_HOME in flume-env.sh to your custom install. Also, the client and server should be the same version. Brock On Mon, Jan 14, 2013 at 4:43 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > ok so I dropped in the new hadoop-core jar in /opt/flume/lib [I got some > errors about the guava dependencies so put in that jar too] > > smehta@collector102:/opt/flume/lib$ ls -ltrh | grep -e "hadoop-core" -e > "guava" > -rw-r--r-- 1 hadoop hadoop 1.5M 2012-11-14 21:49 guava-10.0.1.jar > -rw-r--r-- 1 hadoop hadoop 3.7M 2013-01-14 23:50 > hadoop-core-0.20.2-cdh3u5.jar > > Now I don't event see the file being created in hdfs and the flume log is > happily talking about housekeeping for some file channel checkpoints, > updating pointers et al > > Below is tail of flume log > > hadoop@collector102:/data/flume_log$ tail -10 flume.log > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO > org.apache.flume.channel.file.Log - Updated checkpoint for file: > /data/flume_data/channel2/data/log-36 position: 129415524 logWriteOrderID: > 1358209947324 > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO > org.apache.flume.channel.file.LogFile - Closing RandomReader > /data/flume_data/channel2/data/log-34 > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO > org.apache.flume.channel.file.Log - Updated checkpoint for file: > /data/flume_data/channel1/data/log-36 position: 129415524 logWriteOrderID: > 1358209947323 > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO > org.apache.flume.channel.file.LogFile - Closing RandomReader > /data/flume_data/channel1/data/log-34 > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel2] INFO > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta > currentPosition = 18577138, logWriteOrderID = 1358209947324 > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel1] INFO > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta > currentPosition = 18577138, logWriteOrderID = 1358209947323 > 2013-01-15 00:42:10,820 [Log-BackgroundWorker-channel1] INFO > org.apache.flume.channel.file.LogFile - Closing RandomReader > /data/flume_data/channel1/data/log-35 > 2013-01-15 00:42:10,821 [Log-BackgroundWorker-channel2] INFO > org.apache.flume.channel.file.LogFile - Closing RandomReader > /data/flume_data/channel2/data/log-35 > 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel1] INFO > org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta > currentPosition = 217919486, logWriteOrderID = 1358209947323 > 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel2] INFO > org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta > currentPosition = 217919486, logWriteOrderID = 1358209947324 > > Sagar > > > On Mon, Jan 14, 2013 at 3:38 PM, Brock Noland <[EMAIL PROTECTED]> wrote: >> >> Hmm, could you try and updated version of Hadoop? CDH3u2 is quite old, >> I would upgrade to CDH3u5 or CDH 4.1.2. >> >> On Mon, Jan 14, 2013 at 3:27 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: >> > About the bz2 suggestion, we have a ton of downstream jobs that assume >> > gzip >> > compressed files - so it is better to stick to gzip. >> > >> > The plan B for us is to have a Oozie step to gzip compress the logs >> > before >> > proceeding with downstream Hadoop jobs - but that looks like a hack to >> > me!! >> > >> > Sagar >> > >> > >> > On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <[EMAIL PROTECTED]> >> > wrote: >> >> >> >> hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat >> >> collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l >> >> >> >> gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz: >> >> decompression >> >> OK, trailing garbage ignored >> >> 100 >> >> >> >> This should be about 50,000 events for the 5 min window!! >> >> >> >> Sagar >> >> > Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
-
Re: Question about gzip compression when using Flume NgSagar Mehta 2013-01-15, 01:03
Hmm - good point!! Even in the best case say this works, moving to a newer
Hadoop version for the entire 2 production clusters that depend on it [400+ nodes] will need some thorough testing and won't be immediate. I would have loved for the gzip compression part to have worked more or less out of the box but for now most likely seems to be a Oozie step to pre-compress before downstream takes over. I'm still open to suggestions/insights from this group which has been super-prompt so far :) Sagar On Mon, Jan 14, 2013 at 4:54 PM, Brock Noland <[EMAIL PROTECTED]> wrote: > Hi, > > That's just the file channel. The HDFSEventSink will need a heck of a > lot more than the just those two jars. To override the version of > hadoop it will find from the hadoop command you probably want to set > HADOOP_HOME in flume-env.sh to your custom install. > > Also, the client and server should be the same version. > > Brock > > On Mon, Jan 14, 2013 at 4:43 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > > ok so I dropped in the new hadoop-core jar in /opt/flume/lib [I got some > > errors about the guava dependencies so put in that jar too] > > > > smehta@collector102:/opt/flume/lib$ ls -ltrh | grep -e "hadoop-core" -e > > "guava" > > -rw-r--r-- 1 hadoop hadoop 1.5M 2012-11-14 21:49 guava-10.0.1.jar > > -rw-r--r-- 1 hadoop hadoop 3.7M 2013-01-14 23:50 > > hadoop-core-0.20.2-cdh3u5.jar > > > > Now I don't event see the file being created in hdfs and the flume log is > > happily talking about housekeeping for some file channel checkpoints, > > updating pointers et al > > > > Below is tail of flume log > > > > hadoop@collector102:/data/flume_log$ tail -10 flume.log > > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO > > org.apache.flume.channel.file.Log - Updated checkpoint for file: > > /data/flume_data/channel2/data/log-36 position: 129415524 > logWriteOrderID: > > 1358209947324 > > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO > > org.apache.flume.channel.file.LogFile - Closing RandomReader > > /data/flume_data/channel2/data/log-34 > > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO > > org.apache.flume.channel.file.Log - Updated checkpoint for file: > > /data/flume_data/channel1/data/log-36 position: 129415524 > logWriteOrderID: > > 1358209947323 > > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO > > org.apache.flume.channel.file.LogFile - Closing RandomReader > > /data/flume_data/channel1/data/log-34 > > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel2] INFO > > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta > > currentPosition = 18577138, logWriteOrderID = 1358209947324 > > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel1] INFO > > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta > > currentPosition = 18577138, logWriteOrderID = 1358209947323 > > 2013-01-15 00:42:10,820 [Log-BackgroundWorker-channel1] INFO > > org.apache.flume.channel.file.LogFile - Closing RandomReader > > /data/flume_data/channel1/data/log-35 > > 2013-01-15 00:42:10,821 [Log-BackgroundWorker-channel2] INFO > > org.apache.flume.channel.file.LogFile - Closing RandomReader > > /data/flume_data/channel2/data/log-35 > > 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel1] INFO > > org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta > > currentPosition = 217919486, logWriteOrderID = 1358209947323 > > 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel2] INFO > > org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta > > currentPosition = 217919486, logWriteOrderID = 1358209947324 > > > > Sagar > > > > > > On Mon, Jan 14, 2013 at 3:38 PM, Brock Noland <[EMAIL PROTECTED]> > wrote: > >> > >> Hmm, could you try and updated version of Hadoop? CDH3u2 is quite old, > >> I would upgrade to CDH3u5 or CDH 4.1.2. > >> > >> On Mon, Jan 14, 2013 at 3:27 PM, Sagar Mehta <[EMAIL PROTECTED]> > wrote: > >> > About the bz2 suggestion, we have a ton of downstream jobs that assume
-
Re: Question about gzip compression when using Flume NgConnor Woodson 2013-01-15, 01:17
The issue appears to be in Hadoop's GZip compression, as Flume uses this
libraries to do it; and as you're using older libraries, the gzip isn't as good. A possible problem is that the version of gzip implemented by Hadoop doesn't support concatenated files (I know there's an issue with concatenated bzip2 files and various versions of hadoop). As such, bz2 probably also won't work. It could be possible to do a workaround of creating a gzip serializer, and then writing to hdfs in binary form. I think you will also need to create a new writeFormat, as I'm not quite sure how the SequenceFile one works; but if the gzip bits are computed on the client side then you won't have to deal with whatever implementation of gzip your hadoop uses. Or you upgrade hadoop. I have no idea which is easier. (Or you move your data to S3 and your map-reduce to EMR ;) - Connor On Mon, Jan 14, 2013 at 5:03 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > Hmm - good point!! Even in the best case say this works, moving to a newer > Hadoop version for the entire 2 production clusters that depend on it [400+ > nodes] will need some thorough testing and won't be immediate. > > I would have loved for the gzip compression part to have worked more or > less out of the box but for now most likely seems to be a Oozie step to > pre-compress before downstream takes over. > > I'm still open to suggestions/insights from this group which has been > super-prompt so far :) > > Sagar > > > On Mon, Jan 14, 2013 at 4:54 PM, Brock Noland <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> That's just the file channel. The HDFSEventSink will need a heck of a >> lot more than the just those two jars. To override the version of >> hadoop it will find from the hadoop command you probably want to set >> HADOOP_HOME in flume-env.sh to your custom install. >> >> Also, the client and server should be the same version. >> >> Brock >> >> On Mon, Jan 14, 2013 at 4:43 PM, Sagar Mehta <[EMAIL PROTECTED]> >> wrote: >> > ok so I dropped in the new hadoop-core jar in /opt/flume/lib [I got some >> > errors about the guava dependencies so put in that jar too] >> > >> > smehta@collector102:/opt/flume/lib$ ls -ltrh | grep -e "hadoop-core" -e >> > "guava" >> > -rw-r--r-- 1 hadoop hadoop 1.5M 2012-11-14 21:49 guava-10.0.1.jar >> > -rw-r--r-- 1 hadoop hadoop 3.7M 2013-01-14 23:50 >> > hadoop-core-0.20.2-cdh3u5.jar >> > >> > Now I don't event see the file being created in hdfs and the flume log >> is >> > happily talking about housekeeping for some file channel checkpoints, >> > updating pointers et al >> > >> > Below is tail of flume log >> > >> > hadoop@collector102:/data/flume_log$ tail -10 flume.log >> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO >> > org.apache.flume.channel.file.Log - Updated checkpoint for file: >> > /data/flume_data/channel2/data/log-36 position: 129415524 >> logWriteOrderID: >> > 1358209947324 >> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO >> > org.apache.flume.channel.file.LogFile - Closing RandomReader >> > /data/flume_data/channel2/data/log-34 >> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO >> > org.apache.flume.channel.file.Log - Updated checkpoint for file: >> > /data/flume_data/channel1/data/log-36 position: 129415524 >> logWriteOrderID: >> > 1358209947323 >> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO >> > org.apache.flume.channel.file.LogFile - Closing RandomReader >> > /data/flume_data/channel1/data/log-34 >> > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel2] INFO >> > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta >> > currentPosition = 18577138, logWriteOrderID = 1358209947324 >> > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel1] INFO >> > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta >> > currentPosition = 18577138, logWriteOrderID = 1358209947323 >> > 2013-01-15 00:42:10,820 [Log-BackgroundWorker-channel1] INFO >> > org.apache.flume.channel.file.LogFile - Closing RandomReader
-
Re: Question about gzip compression when using Flume NgBhaskar V. Karambelkar 2013-01-15, 01:25
Sagar,
You're better of downloading and unzipping CDH3u5 or CDH4 some where, and pointing the HADOOP_HOME env. variable to the base directory. That way you won't have to worry about which jar files are needed and which not. Flume will auto add all JARs from the Hadoop Installation that it needs. regards Bhaskar On Mon, Jan 14, 2013 at 7:43 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > ok so I dropped in the new hadoop-core jar in /opt/flume/lib [I got some > errors about the guava dependencies so put in that jar too] > > smehta@collector102:/opt/flume/lib$ ls -ltrh | grep -e "hadoop-core" -e > "guava" > -rw-r--r-- 1 hadoop hadoop 1.5M 2012-11-14 21:49 guava-10.0.1.jar > -rw-r--r-- 1 hadoop hadoop 3.7M 2013-01-14 23:50 > hadoop-core-0.20.2-cdh3u5.jar > > Now I don't event see the file being created in hdfs and the flume log is > happily talking about housekeeping for some file channel checkpoints, > updating pointers et al > > Below is tail of flume log > > *hadoop@collector102:/data/flume_log$ tail -10 flume.log* > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO > org.apache.flume.channel.file.Log - Updated checkpoint for file: > /data/flume_data/channel2/data/log-36 position: 129415524 logWriteOrderID: > 1358209947324 > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO > org.apache.flume.channel.file.LogFile - Closing RandomReader > /data/flume_data/channel2/data/log-34 > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO > org.apache.flume.channel.file.Log - Updated checkpoint for file: > /data/flume_data/channel1/data/log-36 position: 129415524 logWriteOrderID: > 1358209947323 > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO > org.apache.flume.channel.file.LogFile - Closing RandomReader > /data/flume_data/channel1/data/log-34 > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel2] INFO > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta > currentPosition = 18577138, logWriteOrderID = 1358209947324 > 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel1] INFO > org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta > currentPosition = 18577138, logWriteOrderID = 1358209947323 > 2013-01-15 00:42:10,820 [Log-BackgroundWorker-channel1] INFO > org.apache.flume.channel.file.LogFile - Closing RandomReader > /data/flume_data/channel1/data/log-35 > 2013-01-15 00:42:10,821 [Log-BackgroundWorker-channel2] INFO > org.apache.flume.channel.file.LogFile - Closing RandomReader > /data/flume_data/channel2/data/log-35 > 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel1] INFO > org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta > currentPosition = 217919486, logWriteOrderID = 1358209947323 > 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel2] INFO > org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta > currentPosition = 217919486, logWriteOrderID = 1358209947324 > > Sagar > > > On Mon, Jan 14, 2013 at 3:38 PM, Brock Noland <[EMAIL PROTECTED]> wrote: > >> Hmm, could you try and updated version of Hadoop? CDH3u2 is quite old, >> I would upgrade to CDH3u5 or CDH 4.1.2. >> >> On Mon, Jan 14, 2013 at 3:27 PM, Sagar Mehta <[EMAIL PROTECTED]> >> wrote: >> > About the bz2 suggestion, we have a ton of downstream jobs that assume >> gzip >> > compressed files - so it is better to stick to gzip. >> > >> > The plan B for us is to have a Oozie step to gzip compress the logs >> before >> > proceeding with downstream Hadoop jobs - but that looks like a hack to >> me!! >> > >> > Sagar >> > >> > >> > On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <[EMAIL PROTECTED]> >> wrote: >> >> >> >> hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat >> >> collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l >> >> >> >> gzip: collector102.ngpipes.sac.ngmoco.com.1358204406896.gz: >> decompression >> >> OK, trailing garbage ignored >> >> 100 >> >> >> >> This should be about 50,000 events for the 5 min window!! >> >> >> >> Sagar >> >> >> >> On Mon, Jan 14, 2013 at 3:16 PM, Brock Noland <[EMAIL PROTECTED]>
-
Re: Question about gzip compression when using Flume NgConnor Woodson 2013-01-15, 01:26
Is there any incompatibility in trying to write to a different version of
Hadoop then? - Connor On Mon, Jan 14, 2013 at 5:25 PM, Bhaskar V. Karambelkar <[EMAIL PROTECTED] > wrote: > Sagar, > You're better of downloading and unzipping CDH3u5 or CDH4 some where, and > pointing the HADOOP_HOME env. variable to the base directory. > That way you won't have to worry about which jar files are needed and > which not. > Flume will auto add all JARs from the Hadoop Installation that it needs. > > regards > Bhaskar > > > On Mon, Jan 14, 2013 at 7:43 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > >> ok so I dropped in the new hadoop-core jar in /opt/flume/lib [I got some >> errors about the guava dependencies so put in that jar too] >> >> smehta@collector102:/opt/flume/lib$ ls -ltrh | grep -e "hadoop-core" -e >> "guava" >> -rw-r--r-- 1 hadoop hadoop 1.5M 2012-11-14 21:49 guava-10.0.1.jar >> -rw-r--r-- 1 hadoop hadoop 3.7M 2013-01-14 23:50 >> hadoop-core-0.20.2-cdh3u5.jar >> >> Now I don't event see the file being created in hdfs and the flume log is >> happily talking about housekeeping for some file channel checkpoints, >> updating pointers et al >> >> Below is tail of flume log >> >> *hadoop@collector102:/data/flume_log$ tail -10 flume.log* >> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO >> org.apache.flume.channel.file.Log - Updated checkpoint for file: >> /data/flume_data/channel2/data/log-36 position: 129415524 logWriteOrderID: >> 1358209947324 >> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO >> org.apache.flume.channel.file.LogFile - Closing RandomReader >> /data/flume_data/channel2/data/log-34 >> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO >> org.apache.flume.channel.file.Log - Updated checkpoint for file: >> /data/flume_data/channel1/data/log-36 position: 129415524 logWriteOrderID: >> 1358209947323 >> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO >> org.apache.flume.channel.file.LogFile - Closing RandomReader >> /data/flume_data/channel1/data/log-34 >> 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel2] INFO >> org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta >> currentPosition = 18577138, logWriteOrderID = 1358209947324 >> 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel1] INFO >> org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta >> currentPosition = 18577138, logWriteOrderID = 1358209947323 >> 2013-01-15 00:42:10,820 [Log-BackgroundWorker-channel1] INFO >> org.apache.flume.channel.file.LogFile - Closing RandomReader >> /data/flume_data/channel1/data/log-35 >> 2013-01-15 00:42:10,821 [Log-BackgroundWorker-channel2] INFO >> org.apache.flume.channel.file.LogFile - Closing RandomReader >> /data/flume_data/channel2/data/log-35 >> 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel1] INFO >> org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta >> currentPosition = 217919486, logWriteOrderID = 1358209947323 >> 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel2] INFO >> org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta >> currentPosition = 217919486, logWriteOrderID = 1358209947324 >> >> Sagar >> >> >> On Mon, Jan 14, 2013 at 3:38 PM, Brock Noland <[EMAIL PROTECTED]> wrote: >> >>> Hmm, could you try and updated version of Hadoop? CDH3u2 is quite old, >>> I would upgrade to CDH3u5 or CDH 4.1.2. >>> >>> On Mon, Jan 14, 2013 at 3:27 PM, Sagar Mehta <[EMAIL PROTECTED]> >>> wrote: >>> > About the bz2 suggestion, we have a ton of downstream jobs that assume >>> gzip >>> > compressed files - so it is better to stick to gzip. >>> > >>> > The plan B for us is to have a Oozie step to gzip compress the logs >>> before >>> > proceeding with downstream Hadoop jobs - but that looks like a hack to >>> me!! >>> > >>> > Sagar >>> > >>> > >>> > On Mon, Jan 14, 2013 at 3:24 PM, Sagar Mehta <[EMAIL PROTECTED]> >>> wrote: >>> >> >>> >> hadoop@jobtracker301:/home/hadoop/sagar/debug$ zcat >>> >> collector102.ngpipes.sac.ngmoco.com.1358204406896.gz | wc -l
-
Re: Question about gzip compression when using Flume NgSagar Mehta 2013-01-15, 01:52
As for S3, in fact that is our current architecture :) [ EMR computations
were 2 years back - now we do locally in our clusters] and we want to move away from it since our Hadoop computations happen in our own clusters and we end up pulling data from S3 every hour which we would prefer to be local besides us having connectivity issues with S3. As for upgrading Hadoop, yes that is on the near-term roadmap, but as I said since this is a reasonably sized production cluster [400+ nodes in all], so the change won't be over midnight. I also tried changing the hdfs.fileType to SequenceFile but then it complained about need for some native Hadoop code. Sagar On Mon, Jan 14, 2013 at 5:17 PM, Connor Woodson <[EMAIL PROTECTED]>wrote: > The issue appears to be in Hadoop's GZip compression, as Flume uses this > libraries to do it; and as you're using older libraries, the gzip isn't as > good. A possible problem is that the version of gzip implemented by Hadoop > doesn't support concatenated files (I know there's an issue with > concatenated bzip2 files and various versions of hadoop). As such, bz2 > probably also won't work. > > It could be possible to do a workaround of creating a gzip serializer, and > then writing to hdfs in binary form. I think you will also need to create a > new writeFormat, as I'm not quite sure how the SequenceFile one works; but > if the gzip bits are computed on the client side then you won't have to > deal with whatever implementation of gzip your hadoop uses. > > Or you upgrade hadoop. I have no idea which is easier. (Or you move your > data to S3 and your map-reduce to EMR ;) > > - Connor > > > On Mon, Jan 14, 2013 at 5:03 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > >> Hmm - good point!! Even in the best case say this works, moving to a >> newer Hadoop version for the entire 2 production clusters that depend on it >> [400+ nodes] will need some thorough testing and won't be immediate. >> >> I would have loved for the gzip compression part to have worked more or >> less out of the box but for now most likely seems to be a Oozie step to >> pre-compress before downstream takes over. >> >> I'm still open to suggestions/insights from this group which has been >> super-prompt so far :) >> >> Sagar >> >> >> On Mon, Jan 14, 2013 at 4:54 PM, Brock Noland <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> That's just the file channel. The HDFSEventSink will need a heck of a >>> lot more than the just those two jars. To override the version of >>> hadoop it will find from the hadoop command you probably want to set >>> HADOOP_HOME in flume-env.sh to your custom install. >>> >>> Also, the client and server should be the same version. >>> >>> Brock >>> >>> On Mon, Jan 14, 2013 at 4:43 PM, Sagar Mehta <[EMAIL PROTECTED]> >>> wrote: >>> > ok so I dropped in the new hadoop-core jar in /opt/flume/lib [I got >>> some >>> > errors about the guava dependencies so put in that jar too] >>> > >>> > smehta@collector102:/opt/flume/lib$ ls -ltrh | grep -e "hadoop-core" >>> -e >>> > "guava" >>> > -rw-r--r-- 1 hadoop hadoop 1.5M 2012-11-14 21:49 guava-10.0.1.jar >>> > -rw-r--r-- 1 hadoop hadoop 3.7M 2013-01-14 23:50 >>> > hadoop-core-0.20.2-cdh3u5.jar >>> > >>> > Now I don't event see the file being created in hdfs and the flume log >>> is >>> > happily talking about housekeeping for some file channel checkpoints, >>> > updating pointers et al >>> > >>> > Below is tail of flume log >>> > >>> > hadoop@collector102:/data/flume_log$ tail -10 flume.log >>> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO >>> > org.apache.flume.channel.file.Log - Updated checkpoint for file: >>> > /data/flume_data/channel2/data/log-36 position: 129415524 >>> logWriteOrderID: >>> > 1358209947324 >>> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO >>> > org.apache.flume.channel.file.LogFile - Closing RandomReader >>> > /data/flume_data/channel2/data/log-34 >>> > 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO
-
Re: Question about gzip compression when using Flume NgSagar Mehta 2013-01-15, 02:36
Bhaskar,
Your suggestion worked like magic!! I don't believe my eyes!! hadoop@jobtracker301:/home/hadoop/sagar/debug$ hget /ngpipes-raw-logs/2013-01-15/0200/collector102.ngpipes.sac.ngmoco.com.1358216630511.gz . hadoop@jobtracker301:/home/hadoop/sagar/debug$ gunzip collector102.ngpipes.sac.ngmoco.com.1358216630511.gz hadoop@jobtracker301:/home/hadoop/sagar/debug$ ls -ltrh total 34M -rw-r--r-- 1 hadoop hadoop 34M 2013-01-15 02:29 collector102.ngpipes.sac.ngmoco.com.1358216630511 The file decompresses fine!! This is what I did - Downloaded the latest Cloudera stuff here - https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation - It installed hadoop to /usr/lib and I pointed the HADOOP_HOME to /usr/lib/hadoop and restarted Flume!! - Thats it!! - time to party :) Thank you so much guys for your prompt replies!! Sagar On Mon, Jan 14, 2013 at 5:25 PM, Bhaskar V. Karambelkar <[EMAIL PROTECTED] > wrote: > Sagar, > You're better of downloading and unzipping CDH3u5 or CDH4 some where, and > pointing the HADOOP_HOME env. variable to the base directory. > That way you won't have to worry about which jar files are needed and > which not. > Flume will auto add all JARs from the Hadoop Installation that it needs. > > regards > Bhaskar > > > On Mon, Jan 14, 2013 at 7:43 PM, Sagar Mehta <[EMAIL PROTECTED]> wrote: > >> ok so I dropped in the new hadoop-core jar in /opt/flume/lib [I got some >> errors about the guava dependencies so put in that jar too] >> >> smehta@collector102:/opt/flume/lib$ ls -ltrh | grep -e "hadoop-core" -e >> "guava" >> -rw-r--r-- 1 hadoop hadoop 1.5M 2012-11-14 21:49 guava-10.0.1.jar >> -rw-r--r-- 1 hadoop hadoop 3.7M 2013-01-14 23:50 >> hadoop-core-0.20.2-cdh3u5.jar >> >> Now I don't event see the file being created in hdfs and the flume log is >> happily talking about housekeeping for some file channel checkpoints, >> updating pointers et al >> >> Below is tail of flume log >> >> *hadoop@collector102:/data/flume_log$ tail -10 flume.log* >> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO >> org.apache.flume.channel.file.Log - Updated checkpoint for file: >> /data/flume_data/channel2/data/log-36 position: 129415524 logWriteOrderID: >> 1358209947324 >> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel2] INFO >> org.apache.flume.channel.file.LogFile - Closing RandomReader >> /data/flume_data/channel2/data/log-34 >> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO >> org.apache.flume.channel.file.Log - Updated checkpoint for file: >> /data/flume_data/channel1/data/log-36 position: 129415524 logWriteOrderID: >> 1358209947323 >> 2013-01-15 00:42:10,814 [Log-BackgroundWorker-channel1] INFO >> org.apache.flume.channel.file.LogFile - Closing RandomReader >> /data/flume_data/channel1/data/log-34 >> 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel2] INFO >> org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta >> currentPosition = 18577138, logWriteOrderID = 1358209947324 >> 2013-01-15 00:42:10,819 [Log-BackgroundWorker-channel1] INFO >> org.apache.flume.channel.file.LogFileV3 - Updating log-34.meta >> currentPosition = 18577138, logWriteOrderID = 1358209947323 >> 2013-01-15 00:42:10,820 [Log-BackgroundWorker-channel1] INFO >> org.apache.flume.channel.file.LogFile - Closing RandomReader >> /data/flume_data/channel1/data/log-35 >> 2013-01-15 00:42:10,821 [Log-BackgroundWorker-channel2] INFO >> org.apache.flume.channel.file.LogFile - Closing RandomReader >> /data/flume_data/channel2/data/log-35 >> 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel1] INFO >> org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta >> currentPosition = 217919486, logWriteOrderID = 1358209947323 >> 2013-01-15 00:42:10,826 [Log-BackgroundWorker-channel2] INFO >> org.apache.flume.channel.file.LogFileV3 - Updating log-35.meta >> currentPosition = 217919486, logWriteOrderID = 1358209947324 >> >> Sagar >> >> >> On Mon, Jan 14, 2013 at 3:38 PM, Brock Noland <[EMAIL PROTECTED]> wrote: |