Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # dev - Questions about Morphline Solr Sink structure

Copy link to this message
Re: Questions about Morphline Solr Sink structure
Wolfgang Hoschek 2013-11-11, 19:54
Hi Otis,

You bring up a lot of very good points here, indeed. I'll try to answer as best as I can...

In the early days this Flume Sink started out as being very Solr specific. Over time I have made it more generic and reduced the dependency on Solr more and more, and at this point, there is in fact no dependency on Solr in the code left anymore (except in some tests that straddle the boundary between unit tests and integration tests). So in effect it wouldn't be technically wrong to refer to this as a Morphline Sink. The name is just a reflection of an evolutionary journey through history, and for retaining backwards compat.

You could easily use this sink to extract, transform and load data into ES (or any other app or database or storage system) without pulling in any Solr related jar. To do so you'd write a loadElasticSearch morphline command in a separate morphline maven module, and use that command instead of the loadSolr command in your morphline config files. The new loadElasticSearch command would convert a morphline record to a data structure appropriate for ES, e.g. ES JSON/Smile, and send that to ES. That's all there is to it, really.

A morphline record is essentially a hash table where the keys are strings and the values are a list of arbitrary Java objects. Those Java objects are typically Strings and Integers, but they can also be InputStreams or byte[] BLOBs, Avro objects, etc. This data model corresponds exactly to the features of the Lucene data model. It can also be seen as a superset of the Flume event data model - the Flume body is a byte[] value in the morphline _attachment_body field. The data model also maps well to the relational model. It also can be used for hierarchical data considering that the values in a morphline record field can be Avro, JSON, XML, protobufs, or any other custom complex data structure.


On Nov 10, 2013, at 4:42 PM, Otis Gospodnetic wrote:

> Hello,
> One more "proactive" question.
> Isn't all code under the .... solr/morphline package not really about
> Morphline *Solr* Sink, but really more about *Morphline* Sink?
> In other words, if where Morphline actually outputs is dictated by the
> Morphline command in Morphline config (e.g. loadSolr()), then as far
> as Flume is concerned, isn't that really just *Morphline* Sink?
> For example, if I wanted to get Flume to pass events through Morphline
> and have Morphline output to Elasticsearch, I wouldn't really want to
> add a while new Elasticsearch Morphline Sink.  I should really just be
> able to use the existing (misnamed?) Morphline Solr Sink and just
> point it to a Morphline config that has laodElasticsearch() instead of
> loadSolr().
> (please ignore the fact Morphline doesn't actually have
> loadElasticsearch() yet - I think this is a Morphline issue, not a
> Flume issue)
> Is the above correct?
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
> On Sun, Nov 10, 2013 at 7:29 PM, Otis Gospodnetic
> <[EMAIL PROTECTED]> wrote:
>> Hello,
>> Warning: I've got a Flume NG and Morphlines newbie status
>> I was looking at Morphline Solr Sink to see how one could write an
>> equivalent Morphline Elasticsearch Sink, but after looking at the
>> code, I'm a bit confused.  Here are my Qs:
>> 1)  interface MorphlineHandler mentions Solr in N places, but it
>> doesn't seem to be Solr-specific.  Couldn't one reuse this interface
>> for a Morphline ES Sink?
>> 2) In general, couldn't/shouldn't a few classes from
>> org.apache.flume.sink.solr.morphline package really not outside
>> anything solr-specific? e.g.  org.apache.flume.sink.morphline for
>> those that are Morphline-specific?
>> 3) Similarly, BlobDeserializer and BlobHandler don't seem to be even
>> Morphline-specific.  Shouldn't they be elsewhere?
>> 4) I was expecting to see SolrJ (Solr Java client library) being used
>> in MorphlineHandlerImpl or MorphlineSolrSink to send events to Solr,