Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Json to solr example

Copy link to this message
Json to solr example
Hi to all,
I finally manage to make a flow from JSON files to Solr (without solrCell)
and I was thinking this could help someone else..
Obviously this is my solution. Any comment is appreciated!
Note: In my example I should fix the fact that I should use the url field
ad key for the put so I need a way to properly modify/replace
the generateSolrSequenceKey command..

These was my modification to morphine solr tests (I had to add the
dependency to Json morphline of course):

  public void testSolrCellXML() throws Exception {
    morphline = createMorphline("test-morphlines/solrCellXML2");
    String path = RESOURCES_DIR + "/test-documents";
    String[] files = new String[] {
    path + "/somejson.json",
    testDocumentTypesInternal(files, expectedRecords);

this is my somejson.json:


and this is solrCellXML2:

morphlines : [
    id : morphline1
    importCommands : ["com.cloudera.**"]

    commands : [
        readJson {}
       { extractJsonPaths {
          flatten : true # to transform arrays in real arrays (not a String
          paths : {
            url : /url
            last_updated : /meta/timestamp
            category : "/meta/categories/[]"
    language : /meta/language
     content :  /entity/name/
        generateSolrSequenceKey {
          baseIdField: base_id
          solrLocator : ${SOLR_LOCATOR}

        sanitizeUnknownSolrFields {
          solrLocator : ${SOLR_LOCATOR}

      { logDebug { format : "solrcell output: {}", args : ["@{}"] } }
        loadSolr {
          solrLocator : ${SOLR_LOCATOR}

This is my schema.xml

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="example-schema" version="1.5">
   <field name="url" type="string" indexed="true" stored="true"
required="true" multiValued="false" />
   <field name="last_updated" type="long" indexed="true" stored="true"
   <field name="category" type="string" indexed="true" stored="true"
multiValued="true" omitTermFreqAndPositions="false" omitNorms="false"/>
   <field name="tokenized-url" type="text_general" indexed="true"
stored="true" multiValued="false"/>
   <field name="language" type="string" indexed="true" stored="true"
multiValued="false" />
    <!-- A wildcard dynamic-field which collects all the possible fields of
an entity. -->
   <dynamicField name="*" type="text_ws" indexed="true" stored="true"
multiValued="true" omitTermFreqAndPositions="false" omitNorms="false" />
   <field name="_version_" type="long" indexed="true" stored="true"/>
   <copyField source="url" dest="tokenized-url"/>

    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
    <fieldType name="date" class="solr.TrieDateField" precisionStep="0"
    <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6"

   <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
    <fieldType name="text_general" class="solr.TextField"
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>

    <!-- A text field that only splits on whitespace for exact matching of
words -->
    <fieldType name="text_ws" class="solr.TextField"
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>