-Re: Using external indexes in an HBase Map/Reduce job...
Andrey Stepachev 2010-10-13, 06:24
Still don't understand. Looks like you want to optimize scans in hbase.
Lets invent method for you :).
1. Create you custom input format, which will override getSplits method.
like this http://pastebin.org/166201
2. Change splits.start and split.end to min and max keys in you 100k.
100k input: 1 2 3 100 101
splits: [1:99] [100:500].
You can fix you splits: [1:3] [100:101] becase to keys in ranges[3:99] and
3. Optionally you can count keys which fall into ranges (for example [1:3])
and split once
more: [1:2] [3:3] to get more fine grained scans.
4. Optionally implement bloom scan filter, which will use bloom produced
from input keys
and placed on hdfs to exclude unneeded keys.
All this steps should significally reduce number of scanning rows. and n4
number of returned rows.
2010/10/13 Michael Segel <[EMAIL PROTECTED]>:
> You've finally figured out the problem.
> And since the data resides in HBase which I ultimately want to get... its
an HBase problem.
> Were the list of keys in a file sitting on HDFS, its a simple m/r problem.
You have a file reader and you set the number of splits.
> If the index was an HBase table, you just scan the index and use the
HTable input to drive the map/reduce.
> My point was that there isn't a way to take in an object and use that to
drive the m/r.
> And yes, you've come to the same conclusion I came to before writing the
> As to is it worth it? Yes, because right now there is not a good indexing
solution to HBase when it comes to a map/reduce.
> I don't think I'm the first one to think about it....
>> Subject: Re: Using external indexes in an HBase Map/Reduce job...
>> From: [EMAIL PROTECTED]
>> Date: Tue, 12 Oct 2010 13:57:54 -0700
>> To: [EMAIL PROTECTED]
>> This is really more of an M/R question than an HBase question...
>> The problem is that the other nodes in the cluster don't have
access to the memory of the node that has the Java Object. You'll need to
copy it to some other thing that other nodes can read (or create your own
infrastructure that lets other nodes get the data from the object node - not
recommended). If you are running HBase, then you have at least 3 available
to you: DFS, HBase, and Zookeeper. In order for M/R to use it, there needs
to be an InputFormat that knows how to read the data. I know of existing
input formats that can support 2 out of 3 of the above: DFS and HBase. You
could write your own, but it will be more trouble than it is worth. It is
probably best to write the data to one of the two, and have the M/R job read
>> You've probably seen examples that let you pass objects to mappers
and reducers using the job configuration
(org.apache.hadoop.conf.Configuration). This is meant for configuration
items (hence the name) and not large data objects. You could pass the
object this way, but there still needs to be some input data for mappers to
be started up. So, it is possible to have a dummy file that sends data to
the mappers. Once the mapper is started, it can disregard the input data,
read the object from the configuration, and then self select which items in
the list to process based on its own identity, or perhaps even the input
data. While it is possible, I don't recommend it.
>> Good luck,
>> On Oct 12, 2010, at 12:53 PM, Michael Segel wrote:
>> > All,
>> > Let me clarify ...
>> > The ultimate data we want to process is in HBase.
>> > The data qualifiers are not part of the row key so you would have to do
a full table scan to get the data.
>> > (A full table scan of 1 billion rows just to find a subset of 100K
>> > So the idea is what if I got the set of row_keys that I want to process
from an external source.
>> > I don't mention the source, because its not important.
>> > What I am looking at is that at the start of my program, I have this
java List object that contains my 100K record keys for the records I want to
object and not a file or an hfile for input?
last name Smith.
full table scan. (Not really efficient.)
point I'm trying to solve.)
problem is. I saw your original post, but was unclear then as well.
job on some data that resides in a table in HBase. But, since the table is
so large the M/R job would take a long time to process the entire table, so
you want to only process the relevant subset. It also sounds like since you
need M/R, the relevant subset is too large to fit in memory and needs a
distributed solution. Is this correct so far?
TableMapReduceUtil.initTableMapperJob. That method takes a Scan object as
an input. The Scan object can have a filter which is run on the individual
region server to limit the data that gets sent to the job. I've written my
own filters as well, which are quite simple. But, it is a bit of a pain
because you have to make sure the custom filter is in the classpath of the
servers. I've used it to randomly select a subset of data from HBase for
quick test runs of new M/R jobs.
at the RowFilter as a starting point. I haven't used it, but it takes a
WritableByteArrayComparable which could possibly be extended to be based on
a bloom filter or a list.