Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Hadoop/Lucene + Solr architecture suggestions?


Copy link to this message
-
Re: Hadoop/Lucene + Solr architecture suggestions?
Ted Dunning 2012-10-12, 05:23
Your mileage will vary.  I find that this exact motion is what makes reduce
side indexing very attractive since it allows precise sharding control.

On Thu, Oct 11, 2012 at 10:20 PM, Lance Norskog <[EMAIL PROTECTED]> wrote:

> The problem is finding the documents in the search cluster. Making the
> indexes in the reducer means the final m/r pass has to segregate the
> documents according to the distribution function. This uses network
> bandwidth for Hadoop coordination, while using SolrCloud only pushes
> each document once. It's a disk bandwidth v.s. network bandwidth
> problem. Although a combiner in the final m/r pass would really speed
> up the hadoop shuffle.
>
> Lance
>
>     From: "Ted Dunning" <[EMAIL PROTECTED]>
>     To: [EMAIL PROTECTED]
>     Sent: Wednesday, October 10, 2012 11:13:36 PM
>     Subject: Re: Hadoop/Lucene + Solr architecture suggestions?
>
>     Having solr cloud do the indexing instead of having map-reduce do
> the indexing causes substantial write amplification.
>
>     The issue is that cores are replicated in solr cloud.  To keep the
> cores in sync, all of the replicas index the same documents leading to
> amplification.  Further amplification occurs when all of the logs get
> a copy of the document as well.
>
>     Indexing in map-reduce provides considerably higher latency, but
> it drops the write amplification dramatically because only one copy of
> the index needs to be created.  For MapR, this index is created
> directly in the dfs, in vanilla Hadoop you would typically create the
> index in local disk and copy to HDFS.  The major expense, however, is
> the indexing which is nice to do only once.  Reduce-side indexing has
> the advantage that your indexing bandwidth naturally increases with
> your increasing cluster size.
>
>     Deployment of indexes can be done by copying from HDFS, or
> directly deploying using NFS from MapR.  Either way, all of the shard
> replicas appear under the live SolR.  Obviously, if you copy from
> HDFS, you have some issues with making sure that things appear
> correctly.  One way to deal with this is to double the copy steps by
> copying from HDFS and then using a differential copy to leave files as
> unchanged as possible. You want to do that to allow as much of the
> memory image to stay live as possible.  With NFS and transactional
> updates, that isn't an issue, of course.
>
>     On the extreme side, you can host all of the searchers on NFS
> hosted index shards.  You can have one Solr instance devoted to
> indexing each shard.  This will cause the shards to update and each
> Solr index will detect these changes after a few seconds.  This gives
> near real-time search with high isolation between indexing and search
> loads.
>
>
>     On Wed, Oct 10, 2012 at 10:38 PM, JAY <[EMAIL PROTECTED]> wrote:
>
>         How can you store Solr shards in hadoop? Is each data node
> running a Solr server?  If so - is the reducer doing a trick to write
> to a local fs?
>
>         Sent from my iPad
>
>         On Oct 11, 2012, at 12:04 AM, "M. C. Srivas" <[EMAIL PROTECTED]>
> wrote:
>
>             Interestingly, a few MapR customers have gone the other
> way, deliberately having the indexer put  the Solr shards directly
> into MapR and letting it distribute it. Has made index-management a
> cinch.
>
>             Otherwise they do run into what Tim alludes to.
>
>             On Wed, Oct 10, 2012 at 7:27 PM, Tim Williams
> <[EMAIL PROTECTED]> wrote:
>
>                 On Wed, Oct 10, 2012 at 10:15 PM, Lance Norskog
> <[EMAIL PROTECTED]> wrote:
>                 > In the LucidWorks Big Data product, we handle this
> with a reducer that sends documents to a SolrCloud cluster. This way
> the index files are not managed by Hadoop.
>
>                 Hi Lance,
>                 I'm curious if you've gotten that to work with a
> decent-sized (e.g. >
>                 250 node) cluster?  Even a trivial cluster seems to
> crush SolrCloud
>                 from a few months ago at least...