Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Hadoop/Lucene + Solr architecture suggestions?


Copy link to this message
-
Re: Hadoop/Lucene + Solr architecture suggestions?
Your mileage will vary.  I find that this exact motion is what makes reduce
side indexing very attractive since it allows precise sharding control.

On Thu, Oct 11, 2012 at 10:20 PM, Lance Norskog <[EMAIL PROTECTED]> wrote:

> The problem is finding the documents in the search cluster. Making the
> indexes in the reducer means the final m/r pass has to segregate the
> documents according to the distribution function. This uses network
> bandwidth for Hadoop coordination, while using SolrCloud only pushes
> each document once. It's a disk bandwidth v.s. network bandwidth
> problem. Although a combiner in the final m/r pass would really speed
> up the hadoop shuffle.
>
> Lance
>
>     From: "Ted Dunning" <[EMAIL PROTECTED]>
>     To: [EMAIL PROTECTED]
>     Sent: Wednesday, October 10, 2012 11:13:36 PM
>     Subject: Re: Hadoop/Lucene + Solr architecture suggestions?
>
>     Having solr cloud do the indexing instead of having map-reduce do
> the indexing causes substantial write amplification.
>
>     The issue is that cores are replicated in solr cloud.  To keep the
> cores in sync, all of the replicas index the same documents leading to
> amplification.  Further amplification occurs when all of the logs get
> a copy of the document as well.
>
>     Indexing in map-reduce provides considerably higher latency, but
> it drops the write amplification dramatically because only one copy of
> the index needs to be created.  For MapR, this index is created
> directly in the dfs, in vanilla Hadoop you would typically create the
> index in local disk and copy to HDFS.  The major expense, however, is
> the indexing which is nice to do only once.  Reduce-side indexing has
> the advantage that your indexing bandwidth naturally increases with
> your increasing cluster size.
>
>     Deployment of indexes can be done by copying from HDFS, or
> directly deploying using NFS from MapR.  Either way, all of the shard
> replicas appear under the live SolR.  Obviously, if you copy from
> HDFS, you have some issues with making sure that things appear
> correctly.  One way to deal with this is to double the copy steps by
> copying from HDFS and then using a differential copy to leave files as
> unchanged as possible. You want to do that to allow as much of the
> memory image to stay live as possible.  With NFS and transactional
> updates, that isn't an issue, of course.
>
>     On the extreme side, you can host all of the searchers on NFS
> hosted index shards.  You can have one Solr instance devoted to
> indexing each shard.  This will cause the shards to update and each
> Solr index will detect these changes after a few seconds.  This gives
> near real-time search with high isolation between indexing and search
> loads.
>
>
>     On Wed, Oct 10, 2012 at 10:38 PM, JAY <[EMAIL PROTECTED]> wrote:
>
>         How can you store Solr shards in hadoop? Is each data node
> running a Solr server?  If so - is the reducer doing a trick to write
> to a local fs?
>
>         Sent from my iPad
>
>         On Oct 11, 2012, at 12:04 AM, "M. C. Srivas" <[EMAIL PROTECTED]>
> wrote:
>
>             Interestingly, a few MapR customers have gone the other
> way, deliberately having the indexer put  the Solr shards directly
> into MapR and letting it distribute it. Has made index-management a
> cinch.
>
>             Otherwise they do run into what Tim alludes to.
>
>             On Wed, Oct 10, 2012 at 7:27 PM, Tim Williams
> <[EMAIL PROTECTED]> wrote:
>
>                 On Wed, Oct 10, 2012 at 10:15 PM, Lance Norskog
> <[EMAIL PROTECTED]> wrote:
>                 > In the LucidWorks Big Data product, we handle this
> with a reducer that sends documents to a SolrCloud cluster. This way
> the index files are not managed by Hadoop.
>
>                 Hi Lance,
>                 I'm curious if you've gotten that to work with a
> decent-sized (e.g. >
>                 250 node) cluster?  Even a trivial cluster seems to
> crush SolrCloud
>                 from a few months ago at least...
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB