|
Cormier, Christopher
2012-12-14, 20:48
Brock Noland
2012-12-14, 20:52
Cormier, Christopher
2012-12-14, 20:59
Brock Noland
2012-12-14, 21:02
Cormier, Christopher
2012-12-14, 21:08
|
-
Flume/HDFS EncodingCormier, Christopher 2012-12-14, 20:48
Hello All,
I'm also a new user to Flume and was hoping someone could point me in the right direction or tell me what silly little piece I'm missing from the puzzle. I apologize if this has been covered but after searching for a few days I couldn't find anything that helped. Also if there's a better suited group for this to be posted to just let me know. I have flume configured to read from a log4j log file using a tail source and send data into an HDFS sink. All of the plumbing seems to work fine - I'm able to query the data using a quick map reduce job and verify that the entries are in fact getting into Hadoop. What's interesting (annoying) is some additional characters that are being added to each request. Running hadoop dfs -cat somefile I get something like this (where [Data_From_The_Log_Here] is properly formatted and looks valid from what I can tell) : SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.TextY] õpµ^R÷ï³¬Õ ;*j 7[Data_From_The_Log_Here]ÿÿÿÿY] õpµ^R÷ï³¬Õ ;*j [Data_From_The_Log_Here]ÿÿÿÿY] õpµ^R÷ï³¬Õ ;*j% Î[Data_From_The_Log_Here]ÿÿÿÿY] õpµ^R÷ï³¬Õ ;*jF ½[Data_From_The_Log_Here] Here's the flume config: requestToHDFS.channels = MemoryChannel requestToHDFS.sinks = HDFS requestToHDFS.sources = Tail requestToHDFS.sources.Tail.channels = MemoryChannel requestToHDFS.sources.Tail.interceptors = ts requestToHDFS.sources.Tail.interceptors.ts.type = org.apache.flume.interceptor.TimestampInterceptor$Builder requestToHDFS.sources.Tail.type = exec requestToHDFS.sources.Tail.command = tail -F /path/to/someLogFile.log requestToHDFS.sinks.HDFS.channel = MemoryChannel requestToHDFS.sinks.HDFS.type = hdfs requestToHDFS.sinks.HDFS.hdfs.path = hdfs://somehadoopserver:9000/logs/%Y/%m/%d/%H requestToHDFS.sinks.HDFS.hdfs.file.Type = DataStream # also tried... #requestToHDFS.sinks.HDFS.hdfs.file.Type = SequenceFile requestToHDFS.sinks.HDFS.hdfs.writeFormat=Text requestToHDFS.sinks.HDFS.hdfs.batchSize = 10 requestToHDFS.sinks.HDFS.hdfs.rollSize = 0 requestToHDFS.sinks.HDFS.hdfs.rollCount = 10000 requestToHDFS.sinks.HDFS.hdfs.rollInterval = 600 requestToHDFS.channels.MemoryChannel.type = memory requestToHDFS.channels.MemoryChannel.capacity = 10000 requestToHDFS.channels.transactionCapacity = 100 I'm able to get around the issue by doing some parsing in a map reduce job to isolate the log entries I want, but it seems like I'm missing something. The additional characters/encoding/whatever on each line seems to have some data that Flume uses for sending events across the wire. Is there a way to eliminate this before a record is sent into HDFS? Is this just the way records are stored in HDFS and I need to account for the additional characters when querying the data? Ideally the entries in Hadoop would look something like this: [Data_From_The_Log_Here] [Data_From_The_Log_Here] [Data_From_The_Log_Here] Versions are as follows: Flume 1.2.0 Subversion https://svn.apache.org/repos/asf/flume/tags/flume-1.2.0-rc1 -r 1360090<https://svn.apache.org/repos/asf/flume/tags/flume-1.2.0-rc1%20-r%201360090> Hadoop 1.1.1 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1411108 Thanks in advance! Chris
-
Re: Flume/HDFS EncodingBrock Noland 2012-12-14, 20:52
Hi,
On Fri, Dec 14, 2012 at 2:48 PM, Cormier, Christopher <[EMAIL PROTECTED]> wrote: > SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.TextY] õpµ^R÷ï³¬Õ > This is a SequenceFile. > > requestToHDFS.sinks.HDFS.hdfs.file.Type = DataStream > > # also tried... > > #requestToHDFS.sinks.HDFS.hdfs.file.Type = SequenceFile > The parameter is hdfs.fileType. See here: http://flume.apache.org/FlumeUserGuide.html#hdfs-sink It sounds like you want a text file so you should use DataStream. Brock
-
RE: Flume/HDFS EncodingCormier, Christopher 2012-12-14, 20:59
Thanks Brock,
When I run it as a DataStream I still get some weird characters between records. [DATA_HERE]ÿÿÿÿ×ùÎ0ÆÜ9Ig::¬ ;) [DATA_HERE]ÿÿÿÿ×ùÎ0ÆÜ9Ig::¬ ;) Î[DATA_HERE]ÿÿÿÿ×ùÎ0ÆÜ9Ig::¬ ;0 ½[DATA_HERE]ÿÿÿÿ×ùÎ0ÆÜ9Ig::¬ ;0 :[DATA_HERE] I was hoping to avoid the ÿÿÿÿ and spaces (I'm assuming they're characters that are encoded such that -cat won't show them). Any thoughts? Thanks again, Chris -----Original Message----- From: Brock Noland [mailto:[EMAIL PROTECTED]] Sent: Friday, December 14, 2012 3:52 PM To: [EMAIL PROTECTED] Subject: Re: Flume/HDFS Encoding Hi, On Fri, Dec 14, 2012 at 2:48 PM, Cormier, Christopher <[EMAIL PROTECTED]> wrote: > SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.TextY] > õpµ^R÷ï³¬Õ > This is a SequenceFile. > > requestToHDFS.sinks.HDFS.hdfs.file.Type = DataStream > > # also tried... > > #requestToHDFS.sinks.HDFS.hdfs.file.Type = SequenceFile > The parameter is hdfs.fileType. See here: http://flume.apache.org/FlumeUserGuide.html#hdfs-sink It sounds like you want a text file so you should use DataStream. Brock
-
Re: Flume/HDFS EncodingBrock Noland 2012-12-14, 21:02
Did you fix the parameter naming problem I described in the earlier message?
On Fri, Dec 14, 2012 at 2:59 PM, Cormier, Christopher <[EMAIL PROTECTED]> wrote: > Thanks Brock, > When I run it as a DataStream I still get some weird characters between records. > > [DATA_HERE]ÿÿÿÿ×ùÎ0ÆÜ9Ig::¬ ;) > [DATA_HERE]ÿÿÿÿ×ùÎ0ÆÜ9Ig::¬ > ;) > Î[DATA_HERE]ÿÿÿÿ×ùÎ0ÆÜ9Ig::¬ > ;0 > ½[DATA_HERE]ÿÿÿÿ×ùÎ0ÆÜ9Ig::¬ > ;0 > :[DATA_HERE] > > I was hoping to avoid the ÿÿÿÿ and spaces (I'm assuming they're characters that are encoded such that -cat won't show them). > > Any thoughts? > > Thanks again, > > Chris > > -----Original Message----- > From: Brock Noland [mailto:[EMAIL PROTECTED]] > Sent: Friday, December 14, 2012 3:52 PM > To: [EMAIL PROTECTED] > Subject: Re: Flume/HDFS Encoding > > Hi, > > On Fri, Dec 14, 2012 at 2:48 PM, Cormier, Christopher <[EMAIL PROTECTED]> wrote: >> SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.TextY] >> õpµ^R÷ï³¬Õ >> > > This is a SequenceFile. > >> >> requestToHDFS.sinks.HDFS.hdfs.file.Type = DataStream >> >> # also tried... >> >> #requestToHDFS.sinks.HDFS.hdfs.file.Type = SequenceFile >> > > The parameter is hdfs.fileType. See here: > > http://flume.apache.org/FlumeUserGuide.html#hdfs-sink > > It sounds like you want a text file so you should use DataStream. > > Brock -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
-
RE: Flume/HDFS EncodingCormier, Christopher 2012-12-14, 21:08
Of course not. That would have been the smart thing to do.
Both suggestions fixed the problem and it all works as expected. Thank you, Brock, for the quick responses. -Chris -----Original Message----- From: Brock Noland [mailto:[EMAIL PROTECTED]] Sent: Friday, December 14, 2012 4:03 PM To: [EMAIL PROTECTED] Subject: Re: Flume/HDFS Encoding Did you fix the parameter naming problem I described in the earlier message? On Fri, Dec 14, 2012 at 2:59 PM, Cormier, Christopher <[EMAIL PROTECTED]> wrote: > Thanks Brock, > When I run it as a DataStream I still get some weird characters between records. > > [DATA_HERE]ÿÿÿÿ×ùÎ0ÆÜ9Ig::¬ ;) > [DATA_HERE]ÿÿÿÿ×ùÎ0ÆÜ9Ig::¬ > ;) > Î[DATA_HERE]ÿÿÿÿ×ùÎ0ÆÜ9Ig::¬ > ;0 > > ½[DATA_HERE]ÿÿÿÿ×ùÎ0ÆÜ9Ig::¬ > ;0 > :[DATA_HERE] > > I was hoping to avoid the ÿÿÿÿ and spaces (I'm assuming they're characters that are encoded such that -cat won't show them). > > Any thoughts? > > Thanks again, > > Chris > > -----Original Message----- > From: Brock Noland [mailto:[EMAIL PROTECTED]] > Sent: Friday, December 14, 2012 3:52 PM > To: [EMAIL PROTECTED] > Subject: Re: Flume/HDFS Encoding > > Hi, > > On Fri, Dec 14, 2012 at 2:48 PM, Cormier, Christopher <[EMAIL PROTECTED]> wrote: >> SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.TextY] >> õpµ^R÷ï³¬Õ >> > > This is a SequenceFile. > >> >> requestToHDFS.sinks.HDFS.hdfs.file.Type = DataStream >> >> # also tried... >> >> #requestToHDFS.sinks.HDFS.hdfs.file.Type = SequenceFile >> > > The parameter is hdfs.fileType. See here: > > http://flume.apache.org/FlumeUserGuide.html#hdfs-sink > > It sounds like you want a text file so you should use DataStream. > > Brock -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ |