The problem is finding the documents in the search cluster. Making the
indexes in the reducer means the final m/r pass has to segregate the
documents according to the distribution function. This uses network
bandwidth for Hadoop coordination, while using SolrCloud only pushes
each document once. It's a disk bandwidth v.s. network bandwidth
problem. Although a combiner in the final m/r pass would really speed
up the hadoop shuffle.
From: "Ted Dunning" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Wednesday, October 10, 2012 11:13:36 PM
Subject: Re: Hadoop/Lucene + Solr architecture suggestions?
Having solr cloud do the indexing instead of having map-reduce do
the indexing causes substantial write amplification.
The issue is that cores are replicated in solr cloud. To keep the
cores in sync, all of the replicas index the same documents leading to
amplification. Further amplification occurs when all of the logs get
a copy of the document as well.
Indexing in map-reduce provides considerably higher latency, but
it drops the write amplification dramatically because only one copy of
the index needs to be created. For MapR, this index is created
directly in the dfs, in vanilla Hadoop you would typically create the
index in local disk and copy to HDFS. The major expense, however, is
the indexing which is nice to do only once. Reduce-side indexing has
the advantage that your indexing bandwidth naturally increases with
your increasing cluster size.
Deployment of indexes can be done by copying from HDFS, or
directly deploying using NFS from MapR. Either way, all of the shard
replicas appear under the live SolR. Obviously, if you copy from
HDFS, you have some issues with making sure that things appear
correctly. One way to deal with this is to double the copy steps by
copying from HDFS and then using a differential copy to leave files as
unchanged as possible. You want to do that to allow as much of the
memory image to stay live as possible. With NFS and transactional
updates, that isn't an issue, of course.
On the extreme side, you can host all of the searchers on NFS
hosted index shards. You can have one Solr instance devoted to
indexing each shard. This will cause the shards to update and each
Solr index will detect these changes after a few seconds. This gives
near real-time search with high isolation between indexing and search
On Wed, Oct 10, 2012 at 10:38 PM, JAY <[EMAIL PROTECTED]> wrote:
How can you store Solr shards in hadoop? Is each data node
running a Solr server? If so - is the reducer doing a trick to write
to a local fs?
Sent from my iPad
On Oct 11, 2012, at 12:04 AM, "M. C. Srivas" <[EMAIL PROTECTED]> wrote:
Interestingly, a few MapR customers have gone the other
way, deliberately having the indexer put the Solr shards directly
into MapR and letting it distribute it. Has made index-management a
Otherwise they do run into what Tim alludes to.
On Wed, Oct 10, 2012 at 7:27 PM, Tim Williams
<[EMAIL PROTECTED]> wrote:
On Wed, Oct 10, 2012 at 10:15 PM, Lance Norskog
<[EMAIL PROTECTED]> wrote:
> In the LucidWorks Big Data product, we handle this
with a reducer that sends documents to a SolrCloud cluster. This way
the index files are not managed by Hadoop.
I'm curious if you've gotten that to work with a
decent-sized (e.g. >
250 node) cluster? Even a trivial cluster seems to
from a few months ago at least...
> ----- Original Message -----
> | From: "Ted Dunning" <[EMAIL PROTECTED]>
> | To: [EMAIL PROTECTED]
> | Cc: "Hadoop User" <[EMAIL PROTECTED]>
> | Sent: Wednesday, October 10, 2012 7:58:57 AM
> | Subject: Re: Hadoop/Lucene + Solr architecture suggestions?
> | I prefer to create indexes in the reducer personally.
> | Also you can avoid the copies if you use an
> | distro. Email me off list for details.
> | Sent from my iPhone
> | On Oct 9, 2012, at 7:47 PM, Mark Kerzner
> | wrote:
> | > Hi,
> | >
> | > if I create a Lucene index in each mapper,
locally, then copy them
> | > to under /jobid/mapid1, /jodid/mapid2, and then
in the reducers
> | > copy them to some Solr machine (perhaps even
merging), does such
> | > architecture makes sense, to create a searchable
> | > Hadoop?
> | >
> | > Are there links for similar architectures and questions?
> | >
> | > Thank you. Sincerely,
> | > Mark