Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Syslog Infrastructure with Flume


Copy link to this message
-
Re: Syslog Infrastructure with Flume
Thanks Roshan. I understand that it makes it easier for us to use HCatalog - but I am not sure what percentage of Hive users actually use HCat. If we simply use Hive directly, we would be able to address a larger community - which I would definitely like (thought I don't know how feasible it is). I think it might be better to use Hive directly at least to make it more useful to a larger community.  
Hari

--
Hari Shreedharan
On Wednesday, October 31, 2012 at 1:31 PM, Roshan Naik wrote:

> Hari,
>   Indeed from the end user point of view.. both would accomplish roughly the same. From the implementation standpoint, however,  the Hive sink would have to deal with HDFS (for data transfer) and Hive metastore separately to get the job done. The HCat sink implementation, on the other hand, would accomplish the same using just the HCat apis (for both aspects).  The HCat sink implementation would be much simpler and cleaner as it wont have to reinvent things (even if we reuse code form HDFS sink). The grunt work of of moving data and "transactionally committing" them into partitions is handled by HCat apis.
> -roshan
>
>
>
> On Wed, Oct 31, 2012 at 12:22 PM, Hari Shreedharan <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote:
> > Roshan,
> >
> > I am not a Hive/HCatalog pro, but I am just wondering why an HCatalog sink, rather than a Hive Sink? Hive is definitely very popular and would be well appreciated if we could get a Hive Sink written. Since HCatalog is supposed to be compatible with the Hive metastore (right?), why not just implement a Hive sink and make it available to a larger community? I'd definitely like to see a Hive Sink, and would definitely prioritize that and then if explicitly required, add HCatalog support to that - this way it is useful to people who use Hive as well. In fact, there already is a Hive Sink jira here:https://issues.apache.org/jira/browse/FLUME-1008.  
> >
> > I am a +1 for a Hive Sink, so please take a look.
> >
> >
> > Thanks,
> > Hari
> >
> >
> > --
> > Hari Shreedharan
> >
> >
> > On Monday, October 29, 2012 at 4:37 PM, Roshan Naik wrote:
> >
> > > I am in the process of investigating the possibility of creating  a HCatalog sink for Flume which should be able to handle such use cases. For your use case it could be thought of as a Hive sink. Goal is basically as follows... it would allow multiple flume agents to pump logs into a hive tables. That would make the data query-able without additional manual steps. Data will get added periodically in the form of new partitions to Hive. You would not have to deal with temporary files or manual addition of data into hive.  
> > >
> > > -roshan
> > >
> > >
> > >
> > > On Sun, Oct 28, 2012 at 5:45 PM, Ralph Goers <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])> wrote:
> > > > Since you ask...
> > > >
> > > > In our environment our primary concern is audit logs - have have to audit banking transactions as well as changes administrators make. We have a legacy system that needed to be integrated that had records in a form different than what we want stored. We also need to allow administrators to view events as close to real time as possible. Plus we have to aggregate data across 2 data centers. Although we are currently not including web server access logs we plan to integrate them in over time.  We also have requirements from our security team to pass events for their use to ArcSight.
> > > >
> > > > 1. We have a "log extractor" that receives legacy events as they occur and converts them into our new format and passes them to Flume. All new applications use the Log4j 2 Flume Appender to get data to Flume.
> > > > 2. Flume passes the data to ArcSight for our security team's use.
> > > > 3. We wrote a Flume to Cassandra Sink.
> > > > 4. We wrote our own REST query services to retrieve the data from Cassandra.
> > > > 5. Since we are using DataStax Enterprise version of Cassandra we have also set up "Analytic" nodes that run Hadoop on top of Cassandra. This allows the data to be accessed via normal Hadoop tools for data analytics.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB