|
Mark Kerzner
2012-10-10, 02:47
Ted Dunning
2012-10-10, 14:58
Lance Norskog
2012-10-11, 02:15
Tim Williams
2012-10-11, 02:27
M. C. Srivas
2012-10-11, 05:04
JAY
2012-10-11, 05:38
Ted Dunning
2012-10-11, 06:13
Lance Norskog
2012-10-12, 05:20
Ted Dunning
2012-10-12, 05:23
Mark Kerzner
2012-10-11, 02:26
Ivan Frain
2012-10-10, 05:20
|
-
Hadoop/Lucene + Solr architecture suggestions?Mark Kerzner 2012-10-10, 02:47
Hi,
if I create a Lucene index in each mapper, locally, then copy them to under /jobid/mapid1, /jodid/mapid2, and then in the reducers copy them to some Solr machine (perhaps even merging), does such architecture makes sense, to create a searchable index with Hadoop? Are there links for similar architectures and questions? Thank you. Sincerely, Mark +
Mark Kerzner 2012-10-10, 02:47
-
Re: Hadoop/Lucene + Solr architecture suggestions?Ted Dunning 2012-10-10, 14:58
I prefer to create indexes in the reducer personally.
Also you can avoid the copies if you use an advanced hadoop-derived distro. Email me off list for details. Sent from my iPhone On Oct 9, 2012, at 7:47 PM, Mark Kerzner <[EMAIL PROTECTED]> wrote: > Hi, > > if I create a Lucene index in each mapper, locally, then copy them to under /jobid/mapid1, /jodid/mapid2, and then in the reducers copy them to some Solr machine (perhaps even merging), does such architecture makes sense, to create a searchable index with Hadoop? > > Are there links for similar architectures and questions? > > Thank you. Sincerely, > Mark +
Ted Dunning 2012-10-10, 14:58
-
Re: Hadoop/Lucene + Solr architecture suggestions?Lance Norskog 2012-10-11, 02:15
In the LucidWorks Big Data product, we handle this with a reducer that sends documents to a SolrCloud cluster. This way the index files are not managed by Hadoop.
----- Original Message ----- | From: "Ted Dunning" <[EMAIL PROTECTED]> | To: [EMAIL PROTECTED] | Cc: "Hadoop User" <[EMAIL PROTECTED]> | Sent: Wednesday, October 10, 2012 7:58:57 AM | Subject: Re: Hadoop/Lucene + Solr architecture suggestions? | | I prefer to create indexes in the reducer personally. | | Also you can avoid the copies if you use an advanced hadoop-derived | distro. Email me off list for details. | | Sent from my iPhone | | On Oct 9, 2012, at 7:47 PM, Mark Kerzner <[EMAIL PROTECTED]> | wrote: | | > Hi, | > | > if I create a Lucene index in each mapper, locally, then copy them | > to under /jobid/mapid1, /jodid/mapid2, and then in the reducers | > copy them to some Solr machine (perhaps even merging), does such | > architecture makes sense, to create a searchable index with | > Hadoop? | > | > Are there links for similar architectures and questions? | > | > Thank you. Sincerely, | > Mark | +
Lance Norskog 2012-10-11, 02:15
-
Re: Hadoop/Lucene + Solr architecture suggestions?Tim Williams 2012-10-11, 02:27
On Wed, Oct 10, 2012 at 10:15 PM, Lance Norskog <[EMAIL PROTECTED]> wrote:
> In the LucidWorks Big Data product, we handle this with a reducer that sends documents to a SolrCloud cluster. This way the index files are not managed by Hadoop. Hi Lance, I'm curious if you've gotten that to work with a decent-sized (e.g. > 250 node) cluster? Even a trivial cluster seems to crush SolrCloud from a few months ago at least... Thanks, --tim > ----- Original Message ----- > | From: "Ted Dunning" <[EMAIL PROTECTED]> > | To: [EMAIL PROTECTED] > | Cc: "Hadoop User" <[EMAIL PROTECTED]> > | Sent: Wednesday, October 10, 2012 7:58:57 AM > | Subject: Re: Hadoop/Lucene + Solr architecture suggestions? > | > | I prefer to create indexes in the reducer personally. > | > | Also you can avoid the copies if you use an advanced hadoop-derived > | distro. Email me off list for details. > | > | Sent from my iPhone > | > | On Oct 9, 2012, at 7:47 PM, Mark Kerzner <[EMAIL PROTECTED]> > | wrote: > | > | > Hi, > | > > | > if I create a Lucene index in each mapper, locally, then copy them > | > to under /jobid/mapid1, /jodid/mapid2, and then in the reducers > | > copy them to some Solr machine (perhaps even merging), does such > | > architecture makes sense, to create a searchable index with > | > Hadoop? > | > > | > Are there links for similar architectures and questions? > | > > | > Thank you. Sincerely, > | > Mark > | +
Tim Williams 2012-10-11, 02:27
-
Re: Hadoop/Lucene + Solr architecture suggestions?M. C. Srivas 2012-10-11, 05:04
Interestingly, a few MapR customers have gone the other way, deliberately
having the indexer put the Solr shards directly into MapR and letting it distribute it. Has made index-management a cinch. Otherwise they do run into what Tim alludes to. On Wed, Oct 10, 2012 at 7:27 PM, Tim Williams <[EMAIL PROTECTED]> wrote: > On Wed, Oct 10, 2012 at 10:15 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > > In the LucidWorks Big Data product, we handle this with a reducer that > sends documents to a SolrCloud cluster. This way the index files are not > managed by Hadoop. > > Hi Lance, > I'm curious if you've gotten that to work with a decent-sized (e.g. > > 250 node) cluster? Even a trivial cluster seems to crush SolrCloud > from a few months ago at least... > > Thanks, > --tim > > > ----- Original Message ----- > > | From: "Ted Dunning" <[EMAIL PROTECTED]> > > | To: [EMAIL PROTECTED] > > | Cc: "Hadoop User" <[EMAIL PROTECTED]> > > | Sent: Wednesday, October 10, 2012 7:58:57 AM > > | Subject: Re: Hadoop/Lucene + Solr architecture suggestions? > > | > > | I prefer to create indexes in the reducer personally. > > | > > | Also you can avoid the copies if you use an advanced hadoop-derived > > | distro. Email me off list for details. > > | > > | Sent from my iPhone > > | > > | On Oct 9, 2012, at 7:47 PM, Mark Kerzner <[EMAIL PROTECTED]> > > | wrote: > > | > > | > Hi, > > | > > > | > if I create a Lucene index in each mapper, locally, then copy them > > | > to under /jobid/mapid1, /jodid/mapid2, and then in the reducers > > | > copy them to some Solr machine (perhaps even merging), does such > > | > architecture makes sense, to create a searchable index with > > | > Hadoop? > > | > > > | > Are there links for similar architectures and questions? > > | > > > | > Thank you. Sincerely, > > | > Mark > > | > +
M. C. Srivas 2012-10-11, 05:04
-
Re: Hadoop/Lucene + Solr architecture suggestions?JAY 2012-10-11, 05:38
How can you store Solr shards in hadoop? Is each data node running a Solr server? If so - is the reducer doing a trick to write to a local fs?
Sent from my iPad On Oct 11, 2012, at 12:04 AM, "M. C. Srivas" <[EMAIL PROTECTED]> wrote: > Interestingly, a few MapR customers have gone the other way, deliberately having the indexer put the Solr shards directly into MapR and letting it distribute it. Has made index-management a cinch. > > Otherwise they do run into what Tim alludes to. > > On Wed, Oct 10, 2012 at 7:27 PM, Tim Williams <[EMAIL PROTECTED]> wrote: > On Wed, Oct 10, 2012 at 10:15 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > > In the LucidWorks Big Data product, we handle this with a reducer that sends documents to a SolrCloud cluster. This way the index files are not managed by Hadoop. > > Hi Lance, > I'm curious if you've gotten that to work with a decent-sized (e.g. > > 250 node) cluster? Even a trivial cluster seems to crush SolrCloud > from a few months ago at least... > > Thanks, > --tim > > > ----- Original Message ----- > > | From: "Ted Dunning" <[EMAIL PROTECTED]> > > | To: [EMAIL PROTECTED] > > | Cc: "Hadoop User" <[EMAIL PROTECTED]> > > | Sent: Wednesday, October 10, 2012 7:58:57 AM > > | Subject: Re: Hadoop/Lucene + Solr architecture suggestions? > > | > > | I prefer to create indexes in the reducer personally. > > | > > | Also you can avoid the copies if you use an advanced hadoop-derived > > | distro. Email me off list for details. > > | > > | Sent from my iPhone > > | > > | On Oct 9, 2012, at 7:47 PM, Mark Kerzner <[EMAIL PROTECTED]> > > | wrote: > > | > > | > Hi, > > | > > > | > if I create a Lucene index in each mapper, locally, then copy them > > | > to under /jobid/mapid1, /jodid/mapid2, and then in the reducers > > | > copy them to some Solr machine (perhaps even merging), does such > > | > architecture makes sense, to create a searchable index with > > | > Hadoop? > > | > > > | > Are there links for similar architectures and questions? > > | > > > | > Thank you. Sincerely, > > | > Mark > > | > +
JAY 2012-10-11, 05:38
-
Re: Hadoop/Lucene + Solr architecture suggestions?Ted Dunning 2012-10-11, 06:13
Having solr cloud do the indexing instead of having map-reduce do the
indexing causes substantial write amplification. The issue is that cores are replicated in solr cloud. To keep the cores in sync, all of the replicas index the same documents leading to amplification. Further amplification occurs when all of the logs get a copy of the document as well. Indexing in map-reduce provides considerably higher latency, but it drops the write amplification dramatically because only one copy of the index needs to be created. For MapR, this index is created directly in the dfs, in vanilla Hadoop you would typically create the index in local disk and copy to HDFS. The major expense, however, is the indexing which is nice to do only once. Reduce-side indexing has the advantage that your indexing bandwidth naturally increases with your increasing cluster size. Deployment of indexes can be done by copying from HDFS, or directly deploying using NFS from MapR. Either way, all of the shard replicas appear under the live SolR. Obviously, if you copy from HDFS, you have some issues with making sure that things appear correctly. One way to deal with this is to double the copy steps by copying from HDFS and then using a differential copy to leave files as unchanged as possible. You want to do that to allow as much of the memory image to stay live as possible. With NFS and transactional updates, that isn't an issue, of course. On the extreme side, you can host all of the searchers on NFS hosted index shards. You can have one Solr instance devoted to indexing each shard. This will cause the shards to update and each Solr index will detect these changes after a few seconds. This gives near real-time search with high isolation between indexing and search loads. On Wed, Oct 10, 2012 at 10:38 PM, JAY <[EMAIL PROTECTED]> wrote: > How can you store Solr shards in hadoop? Is each data node running a Solr > server? If so - is the reducer doing a trick to write to a local fs? > > Sent from my iPad > > On Oct 11, 2012, at 12:04 AM, "M. C. Srivas" <[EMAIL PROTECTED]> wrote: > > Interestingly, a few MapR customers have gone the other way, deliberately > having the indexer put the Solr shards directly into MapR and letting it > distribute it. Has made index-management a cinch. > > Otherwise they do run into what Tim alludes to. > > On Wed, Oct 10, 2012 at 7:27 PM, Tim Williams < <[EMAIL PROTECTED]> > [EMAIL PROTECTED]> wrote: > >> On Wed, Oct 10, 2012 at 10:15 PM, Lance Norskog < <[EMAIL PROTECTED]> >> [EMAIL PROTECTED]> wrote: >> > In the LucidWorks Big Data product, we handle this with a reducer that >> sends documents to a SolrCloud cluster. This way the index files are not >> managed by Hadoop. >> >> Hi Lance, >> I'm curious if you've gotten that to work with a decent-sized (e.g. > >> 250 node) cluster? Even a trivial cluster seems to crush SolrCloud >> from a few months ago at least... >> >> Thanks, >> --tim >> >> > ----- Original Message ----- >> > | From: "Ted Dunning" < <[EMAIL PROTECTED]>[EMAIL PROTECTED]> >> > | To: <[EMAIL PROTECTED]>[EMAIL PROTECTED] >> > | Cc: "Hadoop User" < <[EMAIL PROTECTED]>[EMAIL PROTECTED]> >> > | Sent: Wednesday, October 10, 2012 7:58:57 AM >> > | Subject: Re: Hadoop/Lucene + Solr architecture suggestions? >> > | >> > | I prefer to create indexes in the reducer personally. >> > | >> > | Also you can avoid the copies if you use an advanced hadoop-derived >> > | distro. Email me off list for details. >> > | >> > | Sent from my iPhone >> > | >> > | On Oct 9, 2012, at 7:47 PM, Mark Kerzner < <[EMAIL PROTECTED]> >> [EMAIL PROTECTED]> >> > | wrote: >> > | >> > | > Hi, >> > | > >> > | > if I create a Lucene index in each mapper, locally, then copy them >> > | > to under /jobid/mapid1, /jodid/mapid2, and then in the reducers >> > | > copy them to some Solr machine (perhaps even merging), does such >> > | > architecture makes sense, to create a searchable index with +
Ted Dunning 2012-10-11, 06:13
-
Re: Hadoop/Lucene + Solr architecture suggestions?Lance Norskog 2012-10-12, 05:20
The problem is finding the documents in the search cluster. Making the
indexes in the reducer means the final m/r pass has to segregate the documents according to the distribution function. This uses network bandwidth for Hadoop coordination, while using SolrCloud only pushes each document once. It's a disk bandwidth v.s. network bandwidth problem. Although a combiner in the final m/r pass would really speed up the hadoop shuffle. Lance From: "Ted Dunning" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, October 10, 2012 11:13:36 PM Subject: Re: Hadoop/Lucene + Solr architecture suggestions? Having solr cloud do the indexing instead of having map-reduce do the indexing causes substantial write amplification. The issue is that cores are replicated in solr cloud. To keep the cores in sync, all of the replicas index the same documents leading to amplification. Further amplification occurs when all of the logs get a copy of the document as well. Indexing in map-reduce provides considerably higher latency, but it drops the write amplification dramatically because only one copy of the index needs to be created. For MapR, this index is created directly in the dfs, in vanilla Hadoop you would typically create the index in local disk and copy to HDFS. The major expense, however, is the indexing which is nice to do only once. Reduce-side indexing has the advantage that your indexing bandwidth naturally increases with your increasing cluster size. Deployment of indexes can be done by copying from HDFS, or directly deploying using NFS from MapR. Either way, all of the shard replicas appear under the live SolR. Obviously, if you copy from HDFS, you have some issues with making sure that things appear correctly. One way to deal with this is to double the copy steps by copying from HDFS and then using a differential copy to leave files as unchanged as possible. You want to do that to allow as much of the memory image to stay live as possible. With NFS and transactional updates, that isn't an issue, of course. On the extreme side, you can host all of the searchers on NFS hosted index shards. You can have one Solr instance devoted to indexing each shard. This will cause the shards to update and each Solr index will detect these changes after a few seconds. This gives near real-time search with high isolation between indexing and search loads. On Wed, Oct 10, 2012 at 10:38 PM, JAY <[EMAIL PROTECTED]> wrote: How can you store Solr shards in hadoop? Is each data node running a Solr server? If so - is the reducer doing a trick to write to a local fs? Sent from my iPad On Oct 11, 2012, at 12:04 AM, "M. C. Srivas" <[EMAIL PROTECTED]> wrote: Interestingly, a few MapR customers have gone the other way, deliberately having the indexer put the Solr shards directly into MapR and letting it distribute it. Has made index-management a cinch. Otherwise they do run into what Tim alludes to. On Wed, Oct 10, 2012 at 7:27 PM, Tim Williams <[EMAIL PROTECTED]> wrote: On Wed, Oct 10, 2012 at 10:15 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > In the LucidWorks Big Data product, we handle this with a reducer that sends documents to a SolrCloud cluster. This way the index files are not managed by Hadoop. Hi Lance, I'm curious if you've gotten that to work with a decent-sized (e.g. > 250 node) cluster? Even a trivial cluster seems to crush SolrCloud from a few months ago at least... Thanks, --tim > ----- Original Message ----- > | From: "Ted Dunning" <[EMAIL PROTECTED]> > | To: [EMAIL PROTECTED] > | Cc: "Hadoop User" <[EMAIL PROTECTED]> > | Sent: Wednesday, October 10, 2012 7:58:57 AM > | Subject: Re: Hadoop/Lucene + Solr architecture suggestions? > | > | I prefer to create indexes in the reducer personally. > | > | Also you can avoid the copies if you use an advanced hadoop-derived > | distro. Email me off list for details. > | > | Sent from my iPhone > | > | On Oct 9, 2012, at 7:47 PM, Mark Kerzner <[EMAIL PROTECTED]> > | wrote: > | > | > Hi, > | > > | > if I create a Lucene index in each mapper, locally, then copy them > | > to under /jobid/mapid1, /jodid/mapid2, and then in the reducers > | > copy them to some Solr machine (perhaps even merging), does such > | > architecture makes sense, to create a searchable index with > | > Hadoop? > | > > | > Are there links for similar architectures and questions? > | > > | > Thank you. Sincerely, > | > Mark > | +
Lance Norskog 2012-10-12, 05:20
-
Re: Hadoop/Lucene + Solr architecture suggestions?Ted Dunning 2012-10-12, 05:23
Your mileage will vary. I find that this exact motion is what makes reduce
side indexing very attractive since it allows precise sharding control. On Thu, Oct 11, 2012 at 10:20 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > The problem is finding the documents in the search cluster. Making the > indexes in the reducer means the final m/r pass has to segregate the > documents according to the distribution function. This uses network > bandwidth for Hadoop coordination, while using SolrCloud only pushes > each document once. It's a disk bandwidth v.s. network bandwidth > problem. Although a combiner in the final m/r pass would really speed > up the hadoop shuffle. > > Lance > > From: "Ted Dunning" <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Wednesday, October 10, 2012 11:13:36 PM > Subject: Re: Hadoop/Lucene + Solr architecture suggestions? > > Having solr cloud do the indexing instead of having map-reduce do > the indexing causes substantial write amplification. > > The issue is that cores are replicated in solr cloud. To keep the > cores in sync, all of the replicas index the same documents leading to > amplification. Further amplification occurs when all of the logs get > a copy of the document as well. > > Indexing in map-reduce provides considerably higher latency, but > it drops the write amplification dramatically because only one copy of > the index needs to be created. For MapR, this index is created > directly in the dfs, in vanilla Hadoop you would typically create the > index in local disk and copy to HDFS. The major expense, however, is > the indexing which is nice to do only once. Reduce-side indexing has > the advantage that your indexing bandwidth naturally increases with > your increasing cluster size. > > Deployment of indexes can be done by copying from HDFS, or > directly deploying using NFS from MapR. Either way, all of the shard > replicas appear under the live SolR. Obviously, if you copy from > HDFS, you have some issues with making sure that things appear > correctly. One way to deal with this is to double the copy steps by > copying from HDFS and then using a differential copy to leave files as > unchanged as possible. You want to do that to allow as much of the > memory image to stay live as possible. With NFS and transactional > updates, that isn't an issue, of course. > > On the extreme side, you can host all of the searchers on NFS > hosted index shards. You can have one Solr instance devoted to > indexing each shard. This will cause the shards to update and each > Solr index will detect these changes after a few seconds. This gives > near real-time search with high isolation between indexing and search > loads. > > > On Wed, Oct 10, 2012 at 10:38 PM, JAY <[EMAIL PROTECTED]> wrote: > > How can you store Solr shards in hadoop? Is each data node > running a Solr server? If so - is the reducer doing a trick to write > to a local fs? > > Sent from my iPad > > On Oct 11, 2012, at 12:04 AM, "M. C. Srivas" <[EMAIL PROTECTED]> > wrote: > > Interestingly, a few MapR customers have gone the other > way, deliberately having the indexer put the Solr shards directly > into MapR and letting it distribute it. Has made index-management a > cinch. > > Otherwise they do run into what Tim alludes to. > > On Wed, Oct 10, 2012 at 7:27 PM, Tim Williams > <[EMAIL PROTECTED]> wrote: > > On Wed, Oct 10, 2012 at 10:15 PM, Lance Norskog > <[EMAIL PROTECTED]> wrote: > > In the LucidWorks Big Data product, we handle this > with a reducer that sends documents to a SolrCloud cluster. This way > the index files are not managed by Hadoop. > > Hi Lance, > I'm curious if you've gotten that to work with a > decent-sized (e.g. > > 250 node) cluster? Even a trivial cluster seems to > crush SolrCloud > from a few months ago at least... +
Ted Dunning 2012-10-12, 05:23
-
Re: Hadoop/Lucene + Solr architecture suggestions?Mark Kerzner 2012-10-11, 02:26
That is very interesting, Lance, thank you.
Mark On Wed, Oct 10, 2012 at 9:15 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > In the LucidWorks Big Data product, we handle this with a reducer that > sends documents to a SolrCloud cluster. This way the index files are not > managed by Hadoop. > > ----- Original Message ----- > | From: "Ted Dunning" <[EMAIL PROTECTED]> > | To: [EMAIL PROTECTED] > | Cc: "Hadoop User" <[EMAIL PROTECTED]> > | Sent: Wednesday, October 10, 2012 7:58:57 AM > | Subject: Re: Hadoop/Lucene + Solr architecture suggestions? > | > | I prefer to create indexes in the reducer personally. > | > | Also you can avoid the copies if you use an advanced hadoop-derived > | distro. Email me off list for details. > | > | Sent from my iPhone > | > | On Oct 9, 2012, at 7:47 PM, Mark Kerzner <[EMAIL PROTECTED]> > | wrote: > | > | > Hi, > | > > | > if I create a Lucene index in each mapper, locally, then copy them > | > to under /jobid/mapid1, /jodid/mapid2, and then in the reducers > | > copy them to some Solr machine (perhaps even merging), does such > | > architecture makes sense, to create a searchable index with > | > Hadoop? > | > > | > Are there links for similar architectures and questions? > | > > | > Thank you. Sincerely, > | > Mark > | > +
Mark Kerzner 2012-10-11, 02:26
-
Re: Hadoop/Lucene + Solr architecture suggestions?Ivan Frain 2012-10-10, 05:20
Hi Mark,
I don't know Lucene/Solr very well but your question made me remember the lily project: http://www.lilyproject.org/lily/index.html. They use hadoop/hbase and solr to provide a searchable data management platform. Maybe you will find ideas in their documentation. BR, Ivan 2012/10/10 Mark Kerzner <[EMAIL PROTECTED]> > Hi, > > if I create a Lucene index in each mapper, locally, then copy them to > under /jobid/mapid1, /jodid/mapid2, and then in the reducers copy them to > some Solr machine (perhaps even merging), does such architecture makes > sense, to create a searchable index with Hadoop? > > Are there links for similar architectures and questions? > > Thank you. Sincerely, > Mark > -- Ivan Frain 11, route de Grenade 31530 Saint-Paul-sur-Save mobile: +33 (0)6 52 52 47 07 +
Ivan Frain 2012-10-10, 05:20
|