Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - Flume logs http request info


Copy link to this message
-
Re: Flume logs http request info
Hari Shreedharan 2013-02-27, 19:08
Thomas,

Looks like your data is written out as text. It is possible that while Flume had written out the entire dataset, your HDFS cluster failed to allocate a fresh block after persisting half your row. In such a case, a dangling partial event is possible - and Flume will retry the whole event because HDFS will throw an exception. Either you should use a binary format where malformed data can be easily identified and discarded or the job you are using should be able to ignore malformed data. I am not a Hive expert, but I know that you can only select rows from a table which match a certain criteria - and making sure you have a non-nullable last column is a good check - so if the last column is null (select * from table where last_row!=null), the row can be ignored - since it may not have been written out correctly.
Hope this helps.
Hari

--
Hari Shreedharan
On Wednesday, February 27, 2013 at 3:25 AM, Thomas Adam wrote:

> Hi,
>
> I have a issue with my flume agents which collectes JSON data and save
> it to an hdfs store for hive. Today my daily job was broken because
> malformed rows. I looked in this files to see what is happend and I
> see I have something like this in my file:
>
> ...
> POST / HTTP/1.0
> Host: localhost:50000
> Content-Length: 185
> Content-Type: application/x-www-form-urlencoded
> ...
>
> And this brokens my JSON serde in Hive. IMHO the flume agents logs
> data themselves and I'm sure that I don't send any things like this.
>
> I have two flume agents.
> The first one collects data from my application with the HTTPSource:
>
> http.sources = user_events
> http.channels = user_events
> http.sinks = user_events
>
> http.sources.user_events.type = org.apache.flume.source.http.HTTPSource
> http.sources.user_events.port = 50000
> http.sources.user_events.interceptors = timestamp
> http.sources.user_events.interceptors.timestamp.type = timestamp
> http.sources.user_events.channels = user_events
>
> http.channels.user_events.type = memory
> http.channels.user_events.capacity = 100000
> http.channels.user_events.transactionCapacity = 1000
>
> http.sinks.user_events.type = avro
> http.sinks.user_events.channel = user_events
> http.sinks.user_events.hostname = 10.2.0.190
> http.sinks.user_events.port = 20000
> http.sinks.user_events.batch-size = 100
>
> And the second agents puts the data into hdfs:
>
> hdfs.sources = user_events
> hdfs.channels = user_events
> hdfs.sinks = user_events
>
> hdfs.sources.user_events.type = avro
> hdfs.sources.user_events.channels = user_events
> hdfs.sources.user_events.bind = 10.2.0.190
> hdfs.sources.user_events.port = 20000
>
> hdfs.channels.user_events.type = memory
> hdfs.channels.user_events.capacity = 100000
> hdfs.channels.user_events.transactionCapacity = 1000
>
> hdfs.sinks.user_events.type = hdfs
> hdfs.sinks.user_events.channel = user_events
> hdfs.sinks.user_events.hdfs.path > hdfs://10.2.0.190:8020/user/beeswax/warehouse/user_events/dt=%Y-%m-%d/hour=%H
> hdfs.sinks.user_events.hdfs.filePrefix = flume
> hdfs.sinks.user_events.hdfs.rollInterval = 600
> hdfs.sinks.user_events.hdfs.rollSize = 134217728
> hdfs.sinks.user_events.hdfs.rollCount = 0
> hdfs.sinks.user_events.hdfs.batchSize = 1000
> hdfs.sinks.user_events.hdfs.fileType = DataStream
>
> It' works since 3 months without any problems and I don't change
> anything in this time.
> I use flume 1.3.0 and cdh 4.1.2
>
> I hope some one can help me too resolve this issue.
>
> Thanks & Regards
> Thomas
>
>