Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> SolrCell help!

Copy link to this message
Re: SolrCell help!
Docs for the xquery and xslt morphline commands are here (look for xquery"): https://github.com/cloudera/cdk/blob/master/cdk-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence

Example morphlines for the new xquery and xslt commands are here: https://github.com/cloudera/cdk/tree/master/cdk-morphlines/cdk-morphlines-saxon/src/test/resources/test-morphlines

Sample input data is here: https://github.com/cloudera/cdk/tree/master/cdk-morphlines/cdk-morphlines-saxon/src/test/resources/test-documents

Unit tests are here: https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-saxon/src/test/java/com/cloudera/cdk/morphline/saxon/SaxonMorphlineTest.java


On Jul 22, 2013, at 1:41 PM, Flavio Pompermaier wrote:

> Ok, I'll try to follow the code! Just one last thing: for morphine-neon I manage to find the test (in cdk repository) but for the new xslt and xquery I'm not able to find the tests code..could you give me an hook?
> On Mon, Jul 22, 2013 at 9:21 PM, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
> There are many tests for this in the morphlines repo.
> Wolfgang.
> On Jul 22, 2013, at 11:43 AM, Flavio Pompermaiert wrote:
> >
> > Thank you for the great support Wolfgang!
> > Flume + Morphlines is undoubtedly an exciting road but its taking me too much time :(
> > Do you think you could add some more tests including readJson and the new xquery and xslt in trunk?
> >
> > Best,
> > Flavio
> > On Mon, Jul 22, 2013 at 8:12 PM, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
> > Looks like the DcXMLParser spits out a metadata field called "title" and another title as part of the Tika XML stream. That metadata field is then added to the solr document by solrcell. If you add "title" to the captures the title from the XML stream gets added as well by solrcell.
> >
> > JSON support has been released in morphlines-0.4.1 (which flume trunk is now depending on): http://cloudera.github.io/cdk/docs/0.4.1/cdk-morphlines/morphlinesReferenceGuide.html#readJson
> >
> > Note that Tika XML doesn't really support/capture XPath extraction with SolrCell. We have added proper support for reading, extracting and transforming XML and HTML with XPath, XQuery and XSLT on the current morphlines trunk (not yet released), similar to the way we already support JSON and Avro. This should make XML handling a lot more straightforward, and make the very limited XML SolrCell approach obsolete. Look for the new "xquery" and "xslt" command in https://github.com/cloudera/cdk/blob/master/cdk-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence
> >
> > Meanwhile, consider using these new commands or, use JSON or Avro, or write your own custom morphline commands that extract whatever you want from your XML data.
> >
> > Wolfgang.
> >
> > On Jul 22, 2013, at 9:18 AM, Flavio Pompermaier wrote:
> >
> > > Hi to all,
> > > I'm trying to understand how to "master" Morphline configuration files in order to put some data into Solr but I'm facing some problem with TestMorphlineSolrSink. This is what I done:
> > >
> > > 1) Since I want to index the title of the testXML.xml (i.e. "Tika test document") so I commented out all the parsers except org.apache.tika.parser.xml.DcXMLParser (which parse Doublin Core metadata)
> > > 2) In schema.xml I added the following field:
> > >     <field name="title" type="text_en" indexed="true" stored="true" multiValued="false" />
> > >
> > > But:
> > >  - If I don't add anything to fmap or capture everything works fine but I don't understand why (who fills that field?). If instead I add to capture title or/and to famp title: title (or dc_title:title) Solr complains that 2 values are retrieved for 'title' (debugging the values I see the title and one empty value in the 'title\ metadata array...).
> > > Thus, the problem is that everything works magically if the field is named title, but if I change its name to something like doc_title there's no way to make it non-multivalued.  Am I right? How can I fix this problem?