Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Converting text to avro in Flume


Copy link to this message
-
Re: Converting text to avro in Flume
The Avro Sink is used for comunication between Flume agents. To directly insert into HDFS you simply use an Avro Serializer with the HDFS sink.
Thanks,
Hari
On Sunday, October 6, 2013 at 3:38 PM, Deepak Subhramanian wrote:

> Hi Hari ,
> I tried using an avro sink after HTTPSource and then an avro source and hdfs sink and it seems to be working. Do we have to use an avro sink first or can we directly convert to avro using HDFS sink ?
>
> Thanks, Deepak
>
>
> On Sun, Oct 6, 2013 at 11:27 PM, Deepak Subhramanian <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote:
> > There was a  mistake in my configuration.  I had hdfs infront of serializer.
> > Changed
> >  tier1.sinks.sink1.hdfs.serializer =  avro_event
> >
> > to  tier1.sinks.sink1.serializer =  avro_event
> >
> > But it is still generating a sequence file. This is what I get.
> >
> > SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.TextK???2-%??-/?? A??,? ?<message>xmldata</message>
> >
> >
> > On Fri, Oct 4, 2013 at 10:43 PM, Deepak Subhramanian <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote:
> > > Thanks Hari.
> > >
> > > I speficied the fileType.  This is what I have. I will try again and let you know.  
> > >
> > > tier1.sources  = httpsrc1
> > > tier1.channels = c1  
> > > tier1.sinks    = sink1  
> > >  
> > > tier1.sources.httpsrc1.bind     = 127.0.0.1
> > > tier1.sources.httpsrc1.type = http
> > > tier1.sources.httpsrc1.port = 9999
> > > tier1.sources.httpsrc1.channels = c1
> > > tier1.sources.httpsrc1.handler = spikes.flume.XMLHandler
> > > tier1.sources.httpsrc1.handler.nickname = HTTPTesting
> > >
> > > tier1.channels.c1.type   = memory
> > > tier1.channels.c1.capacity = 100
> > > #tier1.sinks.sink1.type         = logger
> > > tier1.sinks.sink1.channel      = c1
> > >
> > >  
> > >  tier1.sinks.sink1.type = hdfs
> > >  
> > > tier1.sinks.sink1.hdfs.path = /tmp/flumecollector
> > > tier1.sinks.sink1.hdfs.filePrefix = access_log
> > > tier1.sinks.sink1.hdfs.fileSuffix = .avro
> > > tier1.sinks.sink1.hdfs.fileType = DataStream
> > > tier1.sinks.sink1.hdfs.serializer =  avro_event
> > >
> > > I also added this later.
> > > tier1.sinks.sink1.hdfs.serializer.appendNewline = true
> > > tier1.sinks.sink1.hdfs.serializer.compressionCodec = snappy
> > >  
> > >
> > >
> > >
> > >
> > > On Fri, Oct 4, 2013 at 4:56 PM, Hari Shreedharan <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote:
> > > > The default data type for HDFS Sink is Sequence file. Set the hdfs.fileType to DataStream. See details here: http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
> > > >
> > > >
> > > > Thanks,
> > > > Hari
> > > >
> > > >
> > > > On Friday, October 4, 2013 at 6:52 AM, Deepak Subhramanian wrote:
> > > >
> > > > > I tried using the HDFS Sink to generate the avro file by using the serializer as avro_event. But it is not generating avro file. But a sequence file. Is it not suppose to generate a avro file with default schema ?  Or do I have to generate the avro data from text in my HTTPHandler source ?
> > > > >
> > > > >  "{ \"type\":\"record\", \"name\": \"Event\", \"fields\": [" +
> > > > >       " {\"name\": \"headers\", \"type\": { \"type\": \"map\", \"values\": \"string\" } }, " +
> > > > >       " {\"name\": \"body\", \"type\": \"bytes\" } ] }");  
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Oct 3, 2013 at 3:36 PM, Deepak Subhramanian <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote:
> > > > > > Hi ,
> > > > > >
> > > > > > I want to convert xml files in text to an avro file and store it in hdfs . I get the xml files as a post request. I extended the  HTTPHandler to process the XML post request. Do I have to convert the data in text to avro in HTTPHandler or does the Avro Sink or HDFSSink convert it directly to avro with some configuration details. I want to store the entire xml string in an avro variable.  
> > > > > >
> > > > > > Thanks in advance for any inputs.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB