|
Michael Segel
2010-10-12, 12:36
Andrey Stepachev
2010-10-12, 12:54
Buttler, David
2010-10-12, 15:35
Michael Segel
2010-10-12, 16:13
Michael Segel
2010-10-12, 16:20
Steven Noels
2010-10-12, 17:21
jason
2010-10-12, 17:55
Matthew LeMieux
2010-10-12, 18:53
Michael Segel
2010-10-12, 19:53
Jack Levin
2010-10-12, 20:06
Matthew LeMieux
2010-10-12, 20:57
Michael Segel
2010-10-13, 00:45
Andrey Stepachev
2010-10-13, 06:24
|
-
Using external indexes in an HBase Map/Reduce job...Michael Segel 2010-10-12, 12:36
Hi, Now I realize that most everyone is sitting in NY, while some of us can't leave our respective cities.... Came across this problem and I was wondering how others solved it. Suppose you have a really large table with 1 billion rows of data. Since HBase really doesn't have any indexes built in (Don't get me started about the contrib/transactional stuff...), you're forced to use some sort of external index, or roll your own index table. The net result is that you end up with a list object that contains your result set. So the question is... what's the best way to feed the list object in? One option I thought about is writing the object to a file and then using it as the file in and then control the splitters. Not the most efficient but it would work. Was trying to find a more 'elegant' solution and I'm sure that anyone using SOLR or LUCENE or whatever... had come across this problem too. Any suggestions? Thx
-
Re: Using external indexes in an HBase Map/Reduce job...Andrey Stepachev 2010-10-12, 12:54
Hi Michael Segel.
If I understand your question correctrly, you looking for optimal way for scanning index search results? If not, my answer below is not relevant :). 1. For mr joins or large index results scan bloom filters can be used like described here http://blog.rapleaf.com/dev/2009/09/25/batch-querying-with-cascading/ 2. Another option: denormalize data in same or separate table. (depends on nature of object relations). 3. Random gets. For each row from solr issue random get. (for really small result sets or paging). 4. Put compacted data (latest data, small subset of data etc) into solr index. 2010/10/12 Michael Segel <[EMAIL PROTECTED]>: > > Hi, > > Now I realize that most everyone is sitting in NY, while some of us can't leave our respective cities.... > > Came across this problem and I was wondering how others solved it. > > Suppose you have a really large table with 1 billion rows of data. > Since HBase really doesn't have any indexes built in (Don't get me started about the contrib/transactional stuff...), you're forced to use some sort of external index, or roll your own index table. > > The net result is that you end up with a list object that contains your result set. > > So the question is... what's the best way to feed the list object in? > > One option I thought about is writing the object to a file and then using it as the file in and then control the splitters. Not the most efficient but it would work. > > Was trying to find a more 'elegant' solution and I'm sure that anyone using SOLR or LUCENE or whatever... had come across this problem too. > > Any suggestions? > > Thx > >
-
RE: Using external indexes in an HBase Map/Reduce job...Buttler, David 2010-10-12, 15:35
Sorry, I am not clear on exactly what you are trying to accomplish here. I have a table roughly of that size, and it doesn't seem to cause me any trouble. I also have a few separate solr indexes for data in the table for query -- the solr query syntax is sufficient for my current needs. This setup allows me to do two things efficiently:
1) batch processing of all records (e.g. tagging records that match a particular criteria) 2) search/lookup from a UI in an online manner 3) it is also fairly easy to insert a bunch of records (keeping track of their keys), and then run various batch processes only over those new records -- essentially doing what you suggest: create a file of keys and split the map task over that file. Dave -----Original Message----- From: Michael Segel [mailto:[EMAIL PROTECTED]] Sent: Tuesday, October 12, 2010 5:36 AM To: [EMAIL PROTECTED] Subject: Using external indexes in an HBase Map/Reduce job... Hi, Now I realize that most everyone is sitting in NY, while some of us can't leave our respective cities.... Came across this problem and I was wondering how others solved it. Suppose you have a really large table with 1 billion rows of data. Since HBase really doesn't have any indexes built in (Don't get me started about the contrib/transactional stuff...), you're forced to use some sort of external index, or roll your own index table. The net result is that you end up with a list object that contains your result set. So the question is... what's the best way to feed the list object in? One option I thought about is writing the object to a file and then using it as the file in and then control the splitters. Not the most efficient but it would work. Was trying to find a more 'elegant' solution and I'm sure that anyone using SOLR or LUCENE or whatever... had come across this problem too. Any suggestions? Thx
-
RE: Using external indexes in an HBase Map/Reduce job...Michael Segel 2010-10-12, 16:13
Thanks for the reply... That's not exactly what I'm looking for... Suppose you have an exterior system which provides you the list of row keys you want. What ever that system is. So you have a java list object and you want to do a M/R based on input from a Java List. What's the best way to do it? > From: [EMAIL PROTECTED] > Date: Tue, 12 Oct 2010 16:54:00 +0400 > Subject: Re: Using external indexes in an HBase Map/Reduce job... > To: [EMAIL PROTECTED] > > Hi Michael Segel. > > If I understand your question correctrly, you looking for optimal way > for scanning > index search results? If not, my answer below is not relevant :). > > 1. For mr joins or large index results scan bloom filters can be used > like described here > http://blog.rapleaf.com/dev/2009/09/25/batch-querying-with-cascading/ > > 2. Another option: denormalize data in same or separate table. > (depends on nature of object relations). > > 3. Random gets. For each row from solr issue random get. (for really > small result sets or paging). > > 4. Put compacted data (latest data, small subset of data etc) into solr index. > > > 2010/10/12 Michael Segel <[EMAIL PROTECTED]>: > > > > Hi, > > > > Now I realize that most everyone is sitting in NY, while some of us can't leave our respective cities.... > > > > Came across this problem and I was wondering how others solved it. > > > > Suppose you have a really large table with 1 billion rows of data. > > Since HBase really doesn't have any indexes built in (Don't get me started about the contrib/transactional stuff...), you're forced to use some sort of external index, or roll your own index table. > > > > The net result is that you end up with a list object that contains your result set. > > > > So the question is... what's the best way to feed the list object in? > > > > One option I thought about is writing the object to a file and then using it as the file in and then control the splitters. Not the most efficient but it would work. > > > > Was trying to find a more 'elegant' solution and I'm sure that anyone using SOLR or LUCENE or whatever... had come across this problem too. > > > > Any suggestions? > > > > Thx > > > >
-
RE: Using external indexes in an HBase Map/Reduce job...Michael Segel 2010-10-12, 16:20
Dave, Its a bit more complicated than that. What I can say is that I have a billion rows of data. I want to pull a specific 100K rows from the table. The row keys are not contiguous and you could say they are 'random' such that if I were to do a table scan, I'd have to scan the entire table (All regions). Now if I had a list of the 100k rows. From a single client I could just create 100 threads and grab rows from HBase one at a time in each thread. But in a m/r, I can't really do that. (I want to do processing on the data I get returned.) So given a List Object with the row keys, how do I do a map reduce with this list as the starting point. Sure I could write it to HDFS and then do a m/r reading from the file and setting my own splits to control parallelism. But I'm hoping for a more elegant solution. I know that its possible, but I haven't thought it out... Was hoping someone else had this solved. thx > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Date: Tue, 12 Oct 2010 08:35:25 -0700 > Subject: RE: Using external indexes in an HBase Map/Reduce job... > > Sorry, I am not clear on exactly what you are trying to accomplish here. I have a table roughly of that size, and it doesn't seem to cause me any trouble. I also have a few separate solr indexes for data in the table for query -- the solr query syntax is sufficient for my current needs. This setup allows me to do two things efficiently: > 1) batch processing of all records (e.g. tagging records that match a particular criteria) > 2) search/lookup from a UI in an online manner > 3) it is also fairly easy to insert a bunch of records (keeping track of their keys), and then run various batch processes only over those new records -- essentially doing what you suggest: create a file of keys and split the map task over that file. > > Dave > > > -----Original Message----- > From: Michael Segel [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, October 12, 2010 5:36 AM > To: [EMAIL PROTECTED] > Subject: Using external indexes in an HBase Map/Reduce job... > > > Hi, > > Now I realize that most everyone is sitting in NY, while some of us can't leave our respective cities.... > > Came across this problem and I was wondering how others solved it. > > Suppose you have a really large table with 1 billion rows of data. > Since HBase really doesn't have any indexes built in (Don't get me started about the contrib/transactional stuff...), you're forced to use some sort of external index, or roll your own index table. > > The net result is that you end up with a list object that contains your result set. > > So the question is... what's the best way to feed the list object in? > > One option I thought about is writing the object to a file and then using it as the file in and then control the splitters. Not the most efficient but it would work. > > Was trying to find a more 'elegant' solution and I'm sure that anyone using SOLR or LUCENE or whatever... had come across this problem too. > > Any suggestions? > > Thx > >
-
Re: Using external indexes in an HBase Map/Reduce job...Steven Noels 2010-10-12, 17:21
Did you have a look at Lily? A billion items will be interesting, but we
offer M/R index rebuild (against SOLR) and incremental updates as well. You could take a look at the RowLog library we did to do this in a robust way - which has no Lily dependencies. www.lilyproject.org Cheers, Steven. On Tue, Oct 12, 2010 at 2:36 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > > Hi, > > Now I realize that most everyone is sitting in NY, while some of us can't > leave our respective cities.... > > Came across this problem and I was wondering how others solved it. > > Suppose you have a really large table with 1 billion rows of data. > Since HBase really doesn't have any indexes built in (Don't get me started > about the contrib/transactional stuff...), you're forced to use some sort of > external index, or roll your own index table. > > The net result is that you end up with a list object that contains your > result set. > > So the question is... what's the best way to feed the list object in? > > One option I thought about is writing the object to a file and then using > it as the file in and then control the splitters. Not the most efficient but > it would work. > > Was trying to find a more 'elegant' solution and I'm sure that anyone using > SOLR or LUCENE or whatever... had come across this problem too. > > Any suggestions? > > Thx > > -- Steven Noels http://outerthought.org/ Open Source Content Applications Makers of Kauri, Daisy CMS and Lily
-
Re: Using external indexes in an HBase Map/Reduce job...jason 2010-10-12, 17:55
> What I can say is that I have a billion rows of data.
> I want to pull a specific 100K rows from the table. Michael, I think I have exactly the same use case. Even numbers are the same. I posted a similar question a couple of weeks ago, but unfortunately did not get a definite answer: http://mail-archives.apache.org/mod_mbox/hbase-user/201009.mbox/%[EMAIL PROTECTED]%3E So far, I decided to put HBase aside and experiment with Hadoop directly using its BloomMapFile and its ability to quickly discard files that do not contain requested keys. This implies that I have to have a custom InputFormat for that, many input map files, and sorted list of input keys. I do not have any performance numbers yet to compare this approach to the full scan but I am writing tests as we speak. Please keep me posted if you find a good solution for this problem in general (M/R scanning through a random key subset either based on HBase or Hadoop) On 10/12/10, Michael Segel <[EMAIL PROTECTED]> wrote: > > > Dave, > > Its a bit more complicated than that. > > What I can say is that I have a billion rows of data. > I want to pull a specific 100K rows from the table. > > The row keys are not contiguous and you could say they are 'random' such > that if I were to do a table scan, I'd have to scan the entire table (All > regions). > > Now if I had a list of the 100k rows. From a single client I could just > create 100 threads and grab rows from HBase one at a time in each thread. > > But in a m/r, I can't really do that. (I want to do processing on the data > I get returned.) > > So given a List Object with the row keys, how do I do a map reduce with this > list as the starting point. > > Sure I could write it to HDFS and then do a m/r reading from the file and > setting my own splits to control parallelism. > But I'm hoping for a more elegant solution. > > I know that its possible, but I haven't thought it out... Was hoping someone > else had this solved. > > thx > >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> Date: Tue, 12 Oct 2010 08:35:25 -0700 >> Subject: RE: Using external indexes in an HBase Map/Reduce job... >> >> Sorry, I am not clear on exactly what you are trying to accomplish here. >> I have a table roughly of that size, and it doesn't seem to cause me any >> trouble. I also have a few separate solr indexes for data in the table >> for query -- the solr query syntax is sufficient for my current needs. >> This setup allows me to do two things efficiently: >> 1) batch processing of all records (e.g. tagging records that match a >> particular criteria) >> 2) search/lookup from a UI in an online manner >> 3) it is also fairly easy to insert a bunch of records (keeping track of >> their keys), and then run various batch processes only over those new >> records -- essentially doing what you suggest: create a file of keys and >> split the map task over that file. >> >> Dave >> >> >> -----Original Message----- >> From: Michael Segel [mailto:[EMAIL PROTECTED]] >> Sent: Tuesday, October 12, 2010 5:36 AM >> To: [EMAIL PROTECTED] >> Subject: Using external indexes in an HBase Map/Reduce job... >> >> >> Hi, >> >> Now I realize that most everyone is sitting in NY, while some of us can't >> leave our respective cities.... >> >> Came across this problem and I was wondering how others solved it. >> >> Suppose you have a really large table with 1 billion rows of data. >> Since HBase really doesn't have any indexes built in (Don't get me started >> about the contrib/transactional stuff...), you're forced to use some sort >> of external index, or roll your own index table. >> >> The net result is that you end up with a list object that contains your >> result set. >> >> So the question is... what's the best way to feed the list object in? >> >> One option I thought about is writing the object to a file and then using >> it as the file in and then control the splitters. Not the most efficient
-
Re: Using external indexes in an HBase Map/Reduce job...Matthew LeMieux 2010-10-12, 18:53
I've been reading this thread, and I'm still not clear on what the problem is. I saw your original post, but was unclear then as well.
Please correct me if I'm wrong. It sounds like you want to run a M/R job on some data that resides in a table in HBase. But, since the table is so large the M/R job would take a long time to process the entire table, so you want to only process the relevant subset. It also sounds like since you need M/R, the relevant subset is too large to fit in memory and needs a distributed solution. Is this correct so far? A solution exists: scan filters. The individual region servers filter the data. When setting up the M/R job, I use TableMapReduceUtil.initTableMapperJob. That method takes a Scan object as an input. The Scan object can have a filter which is run on the individual region server to limit the data that gets sent to the job. I've written my own filters as well, which are quite simple. But, it is a bit of a pain because you have to make sure the custom filter is in the classpath of the servers. I've used it to randomly select a subset of data from HBase for quick test runs of new M/R jobs. You might be able to use existing filters. I recommend taking a look at the RowFilter as a starting point. I haven't used it, but it takes a WritableByteArrayComparable which could possibly be extended to be based on a bloom filter or a list. -Matthew On Oct 12, 2010, at 10:55 AM, jason wrote: >> What I can say is that I have a billion rows of data. >> I want to pull a specific 100K rows from the table. > > Michael, I think I have exactly the same use case. Even numbers are the same. > > I posted a similar question a couple of weeks ago, but unfortunately > did not get a definite answer: > > http://mail-archives.apache.org/mod_mbox/hbase-user/201009.mbox/%[EMAIL PROTECTED]%3E > > So far, I decided to put HBase aside and experiment with Hadoop > directly using its BloomMapFile and its ability to quickly discard > files that do not contain requested keys. > This implies that I have to have a custom InputFormat for that, many > input map files, and sorted list of input keys. > > I do not have any performance numbers yet to compare this approach to > the full scan but I am writing tests as we speak. > > Please keep me posted if you find a good solution for this problem in > general (M/R scanning through a random key subset either based on > HBase or Hadoop) > > > > On 10/12/10, Michael Segel <[EMAIL PROTECTED]> wrote: >> >> >> Dave, >> >> Its a bit more complicated than that. >> >> What I can say is that I have a billion rows of data. >> I want to pull a specific 100K rows from the table. >> >> The row keys are not contiguous and you could say they are 'random' such >> that if I were to do a table scan, I'd have to scan the entire table (All >> regions). >> >> Now if I had a list of the 100k rows. From a single client I could just >> create 100 threads and grab rows from HBase one at a time in each thread. >> >> But in a m/r, I can't really do that. (I want to do processing on the data >> I get returned.) >> >> So given a List Object with the row keys, how do I do a map reduce with this >> list as the starting point. >> >> Sure I could write it to HDFS and then do a m/r reading from the file and >> setting my own splits to control parallelism. >> But I'm hoping for a more elegant solution. >> >> I know that its possible, but I haven't thought it out... Was hoping someone >> else had this solved. >> >> thx >> >>> From: [EMAIL PROTECTED] >>> To: [EMAIL PROTECTED] >>> Date: Tue, 12 Oct 2010 08:35:25 -0700 >>> Subject: RE: Using external indexes in an HBase Map/Reduce job... >>> >>> Sorry, I am not clear on exactly what you are trying to accomplish here. >>> I have a table roughly of that size, and it doesn't seem to cause me any >>> trouble. I also have a few separate solr indexes for data in the table >>> for query -- the solr query syntax is sufficient for my current needs.
-
RE: Using external indexes in an HBase Map/Reduce job...Michael Segel 2010-10-12, 19:53
All, Let me clarify ... The ultimate data we want to process is in HBase. The data qualifiers are not part of the row key so you would have to do a full table scan to get the data. (A full table scan of 1 billion rows just to find a subset of 100K rows?) So the idea is what if I got the set of row_keys that I want to process from an external source. I don't mention the source, because its not important. What I am looking at is that at the start of my program, I have this java List object that contains my 100K record keys for the records I want to fetch. So how can I write a m/r that allows me to split and fetch based on a object and not a file or an hfile for input? Let me give you a concrete but imaginary example... I have combined all of the DMV vehicle registrations for all of the US. I want to find all of the cars that are registered to somebody with the last name Smith. Since the owner's last name isn't part of the row key. I have to do a full table scan. (Not really efficient.) Suppose I have an external index. I get the list of row keys in a List Object. Now I want to process the list in a m/r job. So what's the best way to do it? Can you use an object to feed in to a m/r job? (And that's the key point I'm trying to solve.) Does that make sense? -Mike > Subject: Re: Using external indexes in an HBase Map/Reduce job... > From: [EMAIL PROTECTED] > Date: Tue, 12 Oct 2010 11:53:11 -0700 > To: [EMAIL PROTECTED] > > I've been reading this thread, and I'm still not clear on what the problem is. I saw your original post, but was unclear then as well. > > Please correct me if I'm wrong. It sounds like you want to run a M/R job on some data that resides in a table in HBase. But, since the table is so large the M/R job would take a long time to process the entire table, so you want to only process the relevant subset. It also sounds like since you need M/R, the relevant subset is too large to fit in memory and needs a distributed solution. Is this correct so far? > > A solution exists: scan filters. The individual region servers filter the data. > > When setting up the M/R job, I use TableMapReduceUtil.initTableMapperJob. That method takes a Scan object as an input. The Scan object can have a filter which is run on the individual region server to limit the data that gets sent to the job. I've written my own filters as well, which are quite simple. But, it is a bit of a pain because you have to make sure the custom filter is in the classpath of the servers. I've used it to randomly select a subset of data from HBase for quick test runs of new M/R jobs. > > You might be able to use existing filters. I recommend taking a look at the RowFilter as a starting point. I haven't used it, but it takes a WritableByteArrayComparable which could possibly be extended to be based on a bloom filter or a list. > > -Matthew > > On Oct 12, 2010, at 10:55 AM, jason wrote: > > >> What I can say is that I have a billion rows of data. > >> I want to pull a specific 100K rows from the table. > > > > Michael, I think I have exactly the same use case. Even numbers are the same. > > > > I posted a similar question a couple of weeks ago, but unfortunately > > did not get a definite answer: > > > > http://mail-archives.apache.org/mod_mbox/hbase-user/201009.mbox/%[EMAIL PROTECTED]%3E > > > > So far, I decided to put HBase aside and experiment with Hadoop > > directly using its BloomMapFile and its ability to quickly discard > > files that do not contain requested keys. > > This implies that I have to have a custom InputFormat for that, many > > input map files, and sorted list of input keys. > > > > I do not have any performance numbers yet to compare this approach to > > the full scan but I am writing tests as we speak. > > > > Please keep me posted if you find a good solution for this problem in > > general (M/R scanning through a random key subset either based on
-
Re: Using external indexes in an HBase Map/Reduce job...Jack Levin 2010-10-12, 20:06
What I would do is load 100k key values into Hive, do a simple join to produce your output to the local file or another hive table, which can be auto loaded into hbase, this process will run MR and will be quick and won't do any scans
-Jack On Oct 12, 2010, at 12:53 PM, Michael Segel <[EMAIL PROTECTED]> wrote: > > > All, > > Let me clarify ... > > The ultimate data we want to process is in HBase. > > The data qualifiers are not part of the row key so you would have to do a full table scan to get the data. > (A full table scan of 1 billion rows just to find a subset of 100K rows?) > > So the idea is what if I got the set of row_keys that I want to process from an external source. > I don't mention the source, because its not important. > > What I am looking at is that at the start of my program, I have this java List object that contains my 100K record keys for the records I want to fetch. > > So how can I write a m/r that allows me to split and fetch based on a object and not a file or an hfile for input? > > > Let me give you a concrete but imaginary example... > > I have combined all of the DMV vehicle registrations for all of the US. > > I want to find all of the cars that are registered to somebody with the last name Smith. > > Since the owner's last name isn't part of the row key. I have to do a full table scan. (Not really efficient.) > > Suppose I have an external index. I get the list of row keys in a List Object. > > Now I want to process the list in a m/r job. > > So what's the best way to do it? > > Can you use an object to feed in to a m/r job? (And that's the key point I'm trying to solve.) > > Does that make sense? > > -Mike > >> Subject: Re: Using external indexes in an HBase Map/Reduce job... >> From: [EMAIL PROTECTED] >> Date: Tue, 12 Oct 2010 11:53:11 -0700 >> To: [EMAIL PROTECTED] >> >> I've been reading this thread, and I'm still not clear on what the problem is. I saw your original post, but was unclear then as well. >> >> Please correct me if I'm wrong. It sounds like you want to run a M/R job on some data that resides in a table in HBase. But, since the table is so large the M/R job would take a long time to process the entire table, so you want to only process the relevant subset. It also sounds like since you need M/R, the relevant subset is too large to fit in memory and needs a distributed solution. Is this correct so far? >> >> A solution exists: scan filters. The individual region servers filter the data. >> >> When setting up the M/R job, I use TableMapReduceUtil.initTableMapperJob. That method takes a Scan object as an input. The Scan object can have a filter which is run on the individual region server to limit the data that gets sent to the job. I've written my own filters as well, which are quite simple. But, it is a bit of a pain because you have to make sure the custom filter is in the classpath of the servers. I've used it to randomly select a subset of data from HBase for quick test runs of new M/R jobs. >> >> You might be able to use existing filters. I recommend taking a look at the RowFilter as a starting point. I haven't used it, but it takes a WritableByteArrayComparable which could possibly be extended to be based on a bloom filter or a list. >> >> -Matthew >> >> On Oct 12, 2010, at 10:55 AM, jason wrote: >> >>>> What I can say is that I have a billion rows of data. >>>> I want to pull a specific 100K rows from the table. >>> >>> Michael, I think I have exactly the same use case. Even numbers are the same. >>> >>> I posted a similar question a couple of weeks ago, but unfortunately >>> did not get a definite answer: >>> >>> http://mail-archives.apache.org/mod_mbox/hbase-user/201009.mbox/%[EMAIL PROTECTED]%3E >>> >>> So far, I decided to put HBase aside and experiment with Hadoop >>> directly using its BloomMapFile and its ability to quickly discard >>> files that do not contain requested keys.
-
Re: Using external indexes in an HBase Map/Reduce job...Matthew LeMieux 2010-10-12, 20:57
Michael,
This is really more of an M/R question than an HBase question... The problem is that the other nodes in the cluster don't have access to the memory of the node that has the Java Object. You'll need to copy it to some other thing that other nodes can read (or create your own infrastructure that lets other nodes get the data from the object node - not recommended). If you are running HBase, then you have at least 3 available to you: DFS, HBase, and Zookeeper. In order for M/R to use it, there needs to be an InputFormat that knows how to read the data. I know of existing input formats that can support 2 out of 3 of the above: DFS and HBase. You could write your own, but it will be more trouble than it is worth. It is probably best to write the data to one of the two, and have the M/R job read that. You've probably seen examples that let you pass objects to mappers and reducers using the job configuration (org.apache.hadoop.conf.Configuration). This is meant for configuration items (hence the name) and not large data objects. You could pass the object this way, but there still needs to be some input data for mappers to be started up. So, it is possible to have a dummy file that sends data to the mappers. Once the mapper is started, it can disregard the input data, read the object from the configuration, and then self select which items in the list to process based on its own identity, or perhaps even the input data. While it is possible, I don't recommend it. Good luck, Matthew On Oct 12, 2010, at 12:53 PM, Michael Segel wrote: > > > All, > > Let me clarify ... > > The ultimate data we want to process is in HBase. > > The data qualifiers are not part of the row key so you would have to do a full table scan to get the data. > (A full table scan of 1 billion rows just to find a subset of 100K rows?) > > So the idea is what if I got the set of row_keys that I want to process from an external source. > I don't mention the source, because its not important. > > What I am looking at is that at the start of my program, I have this java List object that contains my 100K record keys for the records I want to fetch. > > So how can I write a m/r that allows me to split and fetch based on a object and not a file or an hfile for input? > > > Let me give you a concrete but imaginary example... > > I have combined all of the DMV vehicle registrations for all of the US. > > I want to find all of the cars that are registered to somebody with the last name Smith. > > Since the owner's last name isn't part of the row key. I have to do a full table scan. (Not really efficient.) > > Suppose I have an external index. I get the list of row keys in a List Object. > > Now I want to process the list in a m/r job. > > So what's the best way to do it? > > Can you use an object to feed in to a m/r job? (And that's the key point I'm trying to solve.) > > Does that make sense? > > -Mike > >> Subject: Re: Using external indexes in an HBase Map/Reduce job... >> From: [EMAIL PROTECTED] >> Date: Tue, 12 Oct 2010 11:53:11 -0700 >> To: [EMAIL PROTECTED] >> >> I've been reading this thread, and I'm still not clear on what the problem is. I saw your original post, but was unclear then as well. >> >> Please correct me if I'm wrong. It sounds like you want to run a M/R job on some data that resides in a table in HBase. But, since the table is so large the M/R job would take a long time to process the entire table, so you want to only process the relevant subset. It also sounds like since you need M/R, the relevant subset is too large to fit in memory and needs a distributed solution. Is this correct so far? >> >> A solution exists: scan filters. The individual region servers filter the data. >> >> When setting up the M/R job, I use TableMapReduceUtil.initTableMapperJob. That method takes a Scan object as an input. The Scan object can have a filter which is run on the individual region server to limit the data that gets sent to the job. I've written my own filters as well, which are quite simple. But, it is a bit of a pain because you have to make sure the custom filter is in the classpath of the servers. I've used it to randomly select a subset of data from HBase for quick test runs of new M/R jobs.
-
RE: Using external indexes in an HBase Map/Reduce job...Michael Segel 2010-10-13, 00:45
Mathew, You've finally figured out the problem. And since the data resides in HBase which I ultimately want to get... its an HBase problem. Were the list of keys in a file sitting on HDFS, its a simple m/r problem. You have a file reader and you set the number of splits. If the index was an HBase table, you just scan the index and use the HTable input to drive the map/reduce. My point was that there isn't a way to take in an object and use that to drive the m/r. And yes, you've come to the same conclusion I came to before writing the question. As to is it worth it? Yes, because right now there is not a good indexing solution to HBase when it comes to a map/reduce. I don't think I'm the first one to think about it.... Thx -Mike > Subject: Re: Using external indexes in an HBase Map/Reduce job... > From: [EMAIL PROTECTED] > Date: Tue, 12 Oct 2010 13:57:54 -0700 > To: [EMAIL PROTECTED] > > Michael, > > This is really more of an M/R question than an HBase question... > > The problem is that the other nodes in the cluster don't have access to the memory of the node that has the Java Object. You'll need to copy it to some other thing that other nodes can read (or create your own infrastructure that lets other nodes get the data from the object node - not recommended). If you are running HBase, then you have at least 3 available to you: DFS, HBase, and Zookeeper. In order for M/R to use it, there needs to be an InputFormat that knows how to read the data. I know of existing input formats that can support 2 out of 3 of the above: DFS and HBase. You could write your own, but it will be more trouble than it is worth. It is probably best to write the data to one of the two, and have the M/R job read that. > > You've probably seen examples that let you pass objects to mappers and reducers using the job configuration (org.apache.hadoop.conf.Configuration). This is meant for configuration items (hence the name) and not large data objects. You could pass the object this way, but there still needs to be some input data for mappers to be started up. So, it is possible to have a dummy file that sends data to the mappers. Once the mapper is started, it can disregard the input data, read the object from the configuration, and then self select which items in the list to process based on its own identity, or perhaps even the input data. While it is possible, I don't recommend it. > > Good luck, > > Matthew > > > On Oct 12, 2010, at 12:53 PM, Michael Segel wrote: > > > > > > > All, > > > > Let me clarify ... > > > > The ultimate data we want to process is in HBase. > > > > The data qualifiers are not part of the row key so you would have to do a full table scan to get the data. > > (A full table scan of 1 billion rows just to find a subset of 100K rows?) > > > > So the idea is what if I got the set of row_keys that I want to process from an external source. > > I don't mention the source, because its not important. > > > > What I am looking at is that at the start of my program, I have this java List object that contains my 100K record keys for the records I want to fetch. > > > > So how can I write a m/r that allows me to split and fetch based on a object and not a file or an hfile for input? > > > > > > Let me give you a concrete but imaginary example... > > > > I have combined all of the DMV vehicle registrations for all of the US. > > > > I want to find all of the cars that are registered to somebody with the last name Smith. > > > > Since the owner's last name isn't part of the row key. I have to do a full table scan. (Not really efficient.) > > > > Suppose I have an external index. I get the list of row keys in a List Object. > > > > Now I want to process the list in a m/r job. > > > > So what's the best way to do it? > > > > Can you use an object to feed in to a m/r job? (And that's the key point I'm trying to solve.) > > > > Does that make sense? > > > > -Mike
-
Re: Using external indexes in an HBase Map/Reduce job...Andrey Stepachev 2010-10-13, 06:24
Still don't understand. Looks like you want to optimize scans in hbase.
Lets invent method for you :). 1. Create you custom input format, which will override getSplits method. like this http://pastebin.org/166201 2. Change splits.start and split.end to min and max keys in you 100k. for example: 100k input: 1 2 3 100 101 splits: [1:99] [100:500]. You can fix you splits: [1:3] [100:101] becase to keys in ranges[3:99] and [100:500]. 3. Optionally you can count keys which fall into ranges (for example [1:3]) and split once more: [1:2] [3:3] to get more fine grained scans. 4. Optionally implement bloom scan filter, which will use bloom produced from input keys and placed on hdfs to exclude unneeded keys. All this steps should significally reduce number of scanning rows. and n4 should reduce number of returned rows. 2010/10/13 Michael Segel <[EMAIL PROTECTED]>: > > Mathew, > > You've finally figured out the problem. > > And since the data resides in HBase which I ultimately want to get... its an HBase problem. > > Were the list of keys in a file sitting on HDFS, its a simple m/r problem. You have a file reader and you set the number of splits. > > If the index was an HBase table, you just scan the index and use the HTable input to drive the map/reduce. > > My point was that there isn't a way to take in an object and use that to drive the m/r. > > And yes, you've come to the same conclusion I came to before writing the question. > > As to is it worth it? Yes, because right now there is not a good indexing solution to HBase when it comes to a map/reduce. > > I don't think I'm the first one to think about it.... > > Thx > -Mike > >> Subject: Re: Using external indexes in an HBase Map/Reduce job... >> From: [EMAIL PROTECTED] >> Date: Tue, 12 Oct 2010 13:57:54 -0700 >> To: [EMAIL PROTECTED] >> >> Michael, >> >> This is really more of an M/R question than an HBase question... >> >> The problem is that the other nodes in the cluster don't have access to the memory of the node that has the Java Object. You'll need to copy it to some other thing that other nodes can read (or create your own infrastructure that lets other nodes get the data from the object node - not recommended). If you are running HBase, then you have at least 3 available to you: DFS, HBase, and Zookeeper. In order for M/R to use it, there needs to be an InputFormat that knows how to read the data. I know of existing input formats that can support 2 out of 3 of the above: DFS and HBase. You could write your own, but it will be more trouble than it is worth. It is probably best to write the data to one of the two, and have the M/R job read that. >> >> You've probably seen examples that let you pass objects to mappers and reducers using the job configuration (org.apache.hadoop.conf.Configuration). This is meant for configuration items (hence the name) and not large data objects. You could pass the object this way, but there still needs to be some input data for mappers to be started up. So, it is possible to have a dummy file that sends data to the mappers. Once the mapper is started, it can disregard the input data, read the object from the configuration, and then self select which items in the list to process based on its own identity, or perhaps even the input data. While it is possible, I don't recommend it. >> >> Good luck, >> >> Matthew >> >> >> On Oct 12, 2010, at 12:53 PM, Michael Segel wrote: >> >> > >> > >> > All, >> > >> > Let me clarify ... >> > >> > The ultimate data we want to process is in HBase. >> > >> > The data qualifiers are not part of the row key so you would have to do a full table scan to get the data. >> > (A full table scan of 1 billion rows just to find a subset of 100K rows?) >> > >> > So the idea is what if I got the set of row_keys that I want to process from an external source. >> > I don't mention the source, because its not important. >> > >> > What I am looking at is that at the start of my program, I have this java List object that contains my 100K record keys for the records I want to fetch. object and not a file or an hfile for input? last name Smith. full table scan. (Not really efficient.) Object. point I'm trying to solve.) problem is. I saw your original post, but was unclear then as well. job on some data that resides in a table in HBase. But, since the table is so large the M/R job would take a long time to process the entire table, so you want to only process the relevant subset. It also sounds like since you need M/R, the relevant subset is too large to fit in memory and needs a distributed solution. Is this correct so far? the data. TableMapReduceUtil.initTableMapperJob. That method takes a Scan object as an input. The Scan object can have a filter which is run on the individual region server to limit the data that gets sent to the job. I've written my own filters as well, which are quite simple. But, it is a bit of a pain because you have to make sure the custom filter is in the classpath of the servers. I've used it to randomly select a subset of data from HBase for quick test runs of new M/R jobs. at the RowFilter as a starting point. I haven't used it, but it takes a WritableByteArrayComparable which could possibly be extended to be based on a bloom filter or a list. the same. http://mail-archives.apache.org/mod_mbox/hbase-user/201009.mbox/%[EMAIL PROTECTED]%3E such (All just thread. the data with this file and someone here. me any table needs. a track of new keys and can't started some sort your in? using efficient anyone too. |