Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> Json to solr example


Copy link to this message
-
Json to solr example
Hi to all,
I finally manage to make a flow from JSON files to Solr (without solrCell)
and I was thinking this could help someone else..
Obviously this is my solution. Any comment is appreciated!
Note: In my example I should fix the fact that I should use the url field
ad key for the put so I need a way to properly modify/replace
the generateSolrSequenceKey command..

These was my modification to morphine solr tests (I had to add the
dependency to Json morphline of course):

@Test
  public void testSolrCellXML() throws Exception {
    morphline = createMorphline("test-morphlines/solrCellXML2");
    String path = RESOURCES_DIR + "/test-documents";
    String[] files = new String[] {
    path + "/somejson.json",
    };
    testDocumentTypesInternal(files, expectedRecords);
  }

this is my somejson.json:

{
 "id":"fa10b55e-feac-4e3d-8275-33117ac6da1a",
 "url":"someurl",
 "meta":{
  "timestamp":1372413068,
  "language":"en",
  "categories":[
  "politics",
  "computer",
  "economy"
  ]},
  "entity":{
   "name":"sometext",
   "qualifier":"content"
  }
}

and this is solrCellXML2:

morphlines : [
  {
    id : morphline1
    importCommands : ["com.cloudera.**"]

    commands : [
      {
        readJson {}
      }
       { extractJsonPaths {
          flatten : true # to transform arrays in real arrays (not a String
representation)
          paths : {
            url : /url
            last_updated : /meta/timestamp
            category : "/meta/categories/[]"
    language : /meta/language
     content :  /entity/name/
          }
        }
      }
      {
        generateSolrSequenceKey {
          baseIdField: base_id
          solrLocator : ${SOLR_LOCATOR}
        }
      }

      {
        sanitizeUnknownSolrFields {
          solrLocator : ${SOLR_LOCATOR}
        }
      }

      { logDebug { format : "solrcell output: {}", args : ["@{}"] } }
      {
        loadSolr {
          solrLocator : ${SOLR_LOCATOR}
        }
      }

    ]
  }
]
This is my schema.xml

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="example-schema" version="1.5">
 <fields>
   <field name="url" type="string" indexed="true" stored="true"
required="true" multiValued="false" />
   <field name="last_updated" type="long" indexed="true" stored="true"
multiValued="false"/>
   <field name="category" type="string" indexed="true" stored="true"
multiValued="true" omitTermFreqAndPositions="false" omitNorms="false"/>
   <field name="tokenized-url" type="text_general" indexed="true"
stored="true" multiValued="false"/>
   <field name="language" type="string" indexed="true" stored="true"
multiValued="false" />
    <!-- A wildcard dynamic-field which collects all the possible fields of
an entity. -->
   <dynamicField name="*" type="text_ws" indexed="true" stored="true"
multiValued="true" omitTermFreqAndPositions="false" omitNorms="false" />
   <field name="_version_" type="long" indexed="true" stored="true"/>
   <copyField source="url" dest="tokenized-url"/>
 </fields>
 <uniqueKey>url</uniqueKey>

  <types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
    <fieldType name="date" class="solr.TrieDateField" precisionStep="0"
positionIncrementGap="0"/>
    <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6"
positionIncrementGap="0"/>

   <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
positionIncrementGap="0"/>
    <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
      </analyzer>
    </fieldType>

    <!-- A text field that only splits on whitespace for exact matching of
words -->
    <fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>
 </types>
</schema>

Best,
Flavio
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB