Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - SolrCell help!


Copy link to this message
-
Re: SolrCell help!
Flavio Pompermaier 2013-07-22, 18:43
Thank you for the great support Wolfgang!
Flume + Morphlines is undoubtedly an exciting road but its taking me too
much time :(
Do you think you could add some more tests including readJson and the new
xquery and xslt in trunk?

Best,
Flavio
On Mon, Jul 22, 2013 at 8:12 PM, Wolfgang Hoschek <[EMAIL PROTECTED]>wrote:

> Looks like the DcXMLParser spits out a metadata field called "title" and
> another title as part of the Tika XML stream. That metadata field is then
> added to the solr document by solrcell. If you add "title" to the captures
> the title from the XML stream gets added as well by solrcell.
>
> JSON support has been released in morphlines-0.4.1 (which flume trunk is
> now depending on):
> http://cloudera.github.io/cdk/docs/0.4.1/cdk-morphlines/morphlinesReferenceGuide.html#readJson
>
> Note that Tika XML doesn't really support/capture XPath extraction with
> SolrCell. We have added proper support for reading, extracting and
> transforming XML and HTML with XPath, XQuery and XSLT on the current
> morphlines trunk (not yet released), similar to the way we already support
> JSON and Avro. This should make XML handling a lot more straightforward,
> and make the very limited XML SolrCell approach obsolete. Look for the new
> "xquery" and "xslt" command in
> https://github.com/cloudera/cdk/blob/master/cdk-morphlines/src/site/confluence/morphlinesReferenceGuide.confluence
>
> Meanwhile, consider using these new commands or, use JSON or Avro, or
> write your own custom morphline commands that extract whatever you want
> from your XML data.
>
> Wolfgang.
>
> On Jul 22, 2013, at 9:18 AM, Flavio Pompermaier wrote:
>
> > Hi to all,
> > I'm trying to understand how to "master" Morphline configuration files
> in order to put some data into Solr but I'm facing some problem with
> TestMorphlineSolrSink. This is what I done:
> >
> > 1) Since I want to index the title of the testXML.xml (i.e. "Tika test
> document") so I commented out all the parsers except
> org.apache.tika.parser.xml.DcXMLParser (which parse Doublin Core metadata)
> > 2) In schema.xml I added the following field:
> >     <field name="title" type="text_en" indexed="true" stored="true"
> multiValued="false" />
> >
> > But:
> >  - If I don't add anything to fmap or capture everything works fine but
> I don't understand why (who fills that field?). If instead I add to capture
> title or/and to famp title: title (or dc_title:title) Solr complains that 2
> values are retrieved for 'title' (debugging the values I see the title and
> one empty value in the 'title\ metadata array...).
> > Thus, the problem is that everything works magically if the field is
> named title, but if I change its name to something like doc_title there's
> no way to make it non-multivalued.  Am I right? How can I fix this problem?
> > - I'd like to manage JSON files..How can I map JSON fields to Solr
> fields? Could someone give a simple example?
> >
> > Best,
> > Flavio
>
>