Having solr cloud do the indexing instead of having map-reduce do the
indexing causes substantial write amplification.
The issue is that cores are replicated in solr cloud. To keep the cores in
sync, all of the replicas index the same documents leading to
amplification. Further amplification occurs when all of the logs get a
copy of the document as well.
Indexing in map-reduce provides considerably higher latency, but it drops
the write amplification dramatically because only one copy of the index
needs to be created. For MapR, this index is created directly in the dfs,
in vanilla Hadoop you would typically create the index in local disk and
copy to HDFS. The major expense, however, is the indexing which is nice to
do only once. Reduce-side indexing has the advantage that your indexing
bandwidth naturally increases with your increasing cluster size.
Deployment of indexes can be done by copying from HDFS, or directly
deploying using NFS from MapR. Either way, all of the shard replicas
appear under the live SolR. Obviously, if you copy from HDFS, you have
some issues with making sure that things appear correctly. One way to deal
with this is to double the copy steps by copying from HDFS and then using a
differential copy to leave files as unchanged as possible. You want to do
that to allow as much of the memory image to stay live as possible. With
NFS and transactional updates, that isn't an issue, of course.
On the extreme side, you can host all of the searchers on NFS hosted index
shards. You can have one Solr instance devoted to indexing each shard.
This will cause the shards to update and each Solr index will detect these
changes after a few seconds. This gives near real-time search with high
isolation between indexing and search loads.
On Wed, Oct 10, 2012 at 10:38 PM, JAY <[EMAIL PROTECTED]> wrote:
> How can you store Solr shards in hadoop? Is each data node running a Solr
> server? If so - is the reducer doing a trick to write to a local fs?
> Sent from my iPad
> On Oct 11, 2012, at 12:04 AM, "M. C. Srivas" <[EMAIL PROTECTED]> wrote:
> Interestingly, a few MapR customers have gone the other way, deliberately
> having the indexer put the Solr shards directly into MapR and letting it
> distribute it. Has made index-management a cinch.
> Otherwise they do run into what Tim alludes to.
> On Wed, Oct 10, 2012 at 7:27 PM, Tim Williams < <[EMAIL PROTECTED]>
> [EMAIL PROTECTED]> wrote:
>> On Wed, Oct 10, 2012 at 10:15 PM, Lance Norskog < <[EMAIL PROTECTED]>
>> [EMAIL PROTECTED]> wrote:
>> > In the LucidWorks Big Data product, we handle this with a reducer that
>> sends documents to a SolrCloud cluster. This way the index files are not
>> managed by Hadoop.
>> Hi Lance,
>> I'm curious if you've gotten that to work with a decent-sized (e.g. >
>> 250 node) cluster? Even a trivial cluster seems to crush SolrCloud
>> from a few months ago at least...
>> > ----- Original Message -----
>> > | From: "Ted Dunning" < <[EMAIL PROTECTED]>[EMAIL PROTECTED]>
>> > | To: <[EMAIL PROTECTED]>[EMAIL PROTECTED]
>> > | Cc: "Hadoop User" < <[EMAIL PROTECTED]>[EMAIL PROTECTED]>
>> > | Sent: Wednesday, October 10, 2012 7:58:57 AM
>> > | Subject: Re: Hadoop/Lucene + Solr architecture suggestions?
>> > |
>> > | I prefer to create indexes in the reducer personally.
>> > |
>> > | Also you can avoid the copies if you use an advanced hadoop-derived
>> > | distro. Email me off list for details.
>> > |
>> > | Sent from my iPhone
>> > |
>> > | On Oct 9, 2012, at 7:47 PM, Mark Kerzner < <[EMAIL PROTECTED]>
>> [EMAIL PROTECTED]>
>> > | wrote:
>> > |
>> > | > Hi,
>> > | >
>> > | > if I create a Lucene index in each mapper, locally, then copy them
>> > | > to under /jobid/mapid1, /jodid/mapid2, and then in the reducers
>> > | > copy them to some Solr machine (perhaps even merging), does such
>> > | > architecture makes sense, to create a searchable index with