Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Flume logs http request info


Copy link to this message
-
Re: Flume logs http request info
Thomas,

Looks like your data is written out as text. It is possible that while Flume had written out the entire dataset, your HDFS cluster failed to allocate a fresh block after persisting half your row. In such a case, a dangling partial event is possible - and Flume will retry the whole event because HDFS will throw an exception. Either you should use a binary format where malformed data can be easily identified and discarded or the job you are using should be able to ignore malformed data. I am not a Hive expert, but I know that you can only select rows from a table which match a certain criteria - and making sure you have a non-nullable last column is a good check - so if the last column is null (select * from table where last_row!=null), the row can be ignored - since it may not have been written out correctly.
Hope this helps.
Hari

--
Hari Shreedharan
On Wednesday, February 27, 2013 at 3:25 AM, Thomas Adam wrote:

> Hi,
>
> I have a issue with my flume agents which collectes JSON data and save
> it to an hdfs store for hive. Today my daily job was broken because
> malformed rows. I looked in this files to see what is happend and I
> see I have something like this in my file:
>
> ...
> POST / HTTP/1.0
> Host: localhost:50000
> Content-Length: 185
> Content-Type: application/x-www-form-urlencoded
> ...
>
> And this brokens my JSON serde in Hive. IMHO the flume agents logs
> data themselves and I'm sure that I don't send any things like this.
>
> I have two flume agents.
> The first one collects data from my application with the HTTPSource:
>
> http.sources = user_events
> http.channels = user_events
> http.sinks = user_events
>
> http.sources.user_events.type = org.apache.flume.source.http.HTTPSource
> http.sources.user_events.port = 50000
> http.sources.user_events.interceptors = timestamp
> http.sources.user_events.interceptors.timestamp.type = timestamp
> http.sources.user_events.channels = user_events
>
> http.channels.user_events.type = memory
> http.channels.user_events.capacity = 100000
> http.channels.user_events.transactionCapacity = 1000
>
> http.sinks.user_events.type = avro
> http.sinks.user_events.channel = user_events
> http.sinks.user_events.hostname = 10.2.0.190
> http.sinks.user_events.port = 20000
> http.sinks.user_events.batch-size = 100
>
> And the second agents puts the data into hdfs:
>
> hdfs.sources = user_events
> hdfs.channels = user_events
> hdfs.sinks = user_events
>
> hdfs.sources.user_events.type = avro
> hdfs.sources.user_events.channels = user_events
> hdfs.sources.user_events.bind = 10.2.0.190
> hdfs.sources.user_events.port = 20000
>
> hdfs.channels.user_events.type = memory
> hdfs.channels.user_events.capacity = 100000
> hdfs.channels.user_events.transactionCapacity = 1000
>
> hdfs.sinks.user_events.type = hdfs
> hdfs.sinks.user_events.channel = user_events
> hdfs.sinks.user_events.hdfs.path > hdfs://10.2.0.190:8020/user/beeswax/warehouse/user_events/dt=%Y-%m-%d/hour=%H
> hdfs.sinks.user_events.hdfs.filePrefix = flume
> hdfs.sinks.user_events.hdfs.rollInterval = 600
> hdfs.sinks.user_events.hdfs.rollSize = 134217728
> hdfs.sinks.user_events.hdfs.rollCount = 0
> hdfs.sinks.user_events.hdfs.batchSize = 1000
> hdfs.sinks.user_events.hdfs.fileType = DataStream
>
> It' works since 3 months without any problems and I don't change
> anything in this time.
> I use flume 1.3.0 and cdh 4.1.2
>
> I hope some one can help me too resolve this issue.
>
> Thanks & Regards
> Thomas
>
>