Mark Kerzner 2012-10-10, 02:47
Ted Dunning 2012-10-10, 14:58
Lance Norskog 2012-10-11, 02:15
Tim Williams 2012-10-11, 02:27
M. C. Srivas 2012-10-11, 05:04
JAY 2012-10-11, 05:38
Ted Dunning 2012-10-11, 06:13
Lance Norskog 2012-10-12, 05:20
Your mileage will vary. I find that this exact motion is what makes reduce
side indexing very attractive since it allows precise sharding control.
On Thu, Oct 11, 2012 at 10:20 PM, Lance Norskog <[EMAIL PROTECTED]> wrote:
> The problem is finding the documents in the search cluster. Making the
> indexes in the reducer means the final m/r pass has to segregate the
> documents according to the distribution function. This uses network
> bandwidth for Hadoop coordination, while using SolrCloud only pushes
> each document once. It's a disk bandwidth v.s. network bandwidth
> problem. Although a combiner in the final m/r pass would really speed
> up the hadoop shuffle.
> From: "Ted Dunning" <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Wednesday, October 10, 2012 11:13:36 PM
> Subject: Re: Hadoop/Lucene + Solr architecture suggestions?
> Having solr cloud do the indexing instead of having map-reduce do
> the indexing causes substantial write amplification.
> The issue is that cores are replicated in solr cloud. To keep the
> cores in sync, all of the replicas index the same documents leading to
> amplification. Further amplification occurs when all of the logs get
> a copy of the document as well.
> Indexing in map-reduce provides considerably higher latency, but
> it drops the write amplification dramatically because only one copy of
> the index needs to be created. For MapR, this index is created
> directly in the dfs, in vanilla Hadoop you would typically create the
> index in local disk and copy to HDFS. The major expense, however, is
> the indexing which is nice to do only once. Reduce-side indexing has
> the advantage that your indexing bandwidth naturally increases with
> your increasing cluster size.
> Deployment of indexes can be done by copying from HDFS, or
> directly deploying using NFS from MapR. Either way, all of the shard
> replicas appear under the live SolR. Obviously, if you copy from
> HDFS, you have some issues with making sure that things appear
> correctly. One way to deal with this is to double the copy steps by
> copying from HDFS and then using a differential copy to leave files as
> unchanged as possible. You want to do that to allow as much of the
> memory image to stay live as possible. With NFS and transactional
> updates, that isn't an issue, of course.
> On the extreme side, you can host all of the searchers on NFS
> hosted index shards. You can have one Solr instance devoted to
> indexing each shard. This will cause the shards to update and each
> Solr index will detect these changes after a few seconds. This gives
> near real-time search with high isolation between indexing and search
> On Wed, Oct 10, 2012 at 10:38 PM, JAY <[EMAIL PROTECTED]> wrote:
> How can you store Solr shards in hadoop? Is each data node
> running a Solr server? If so - is the reducer doing a trick to write
> to a local fs?
> Sent from my iPad
> On Oct 11, 2012, at 12:04 AM, "M. C. Srivas" <[EMAIL PROTECTED]>
> Interestingly, a few MapR customers have gone the other
> way, deliberately having the indexer put the Solr shards directly
> into MapR and letting it distribute it. Has made index-management a
> Otherwise they do run into what Tim alludes to.
> On Wed, Oct 10, 2012 at 7:27 PM, Tim Williams
> <[EMAIL PROTECTED]> wrote:
> On Wed, Oct 10, 2012 at 10:15 PM, Lance Norskog
> <[EMAIL PROTECTED]> wrote:
> > In the LucidWorks Big Data product, we handle this
> with a reducer that sends documents to a SolrCloud cluster. This way
> the index files are not managed by Hadoop.
> Hi Lance,
> I'm curious if you've gotten that to work with a
> decent-sized (e.g. >
> 250 node) cluster? Even a trivial cluster seems to
> crush SolrCloud
> from a few months ago at least...
Mark Kerzner 2012-10-11, 02:26
Ivan Frain 2012-10-10, 05:20