On Fri, Apr 15, 2011 at 11:45 AM, W.P. McNeill <[EMAIL PROTECTED]> wrote:
> Thanks for your answer. After mulling over this problem for a few days, I
> believe there might be a clearer way for me to phrase to question, so let me
> try that before diving into the specifics of the linear algebra analysis you
> I need to share an inverted index of elements to sets as described above.
> And crucially this index is *immutable*: after it has been created it only
> has to be read from, never written to. So a clearer way to phrase this
> question is: how do I share a large read-only inverted index among multiple
> MapReduce jobs?
> I can think of two approaches.
> 1. Treat it as a database JOIN on elements operation between the
> original table of sets and the inverted index. This is the tack that Ted was
> suggesting in his response.
> 2. Put the inverted index into a MapFile<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html>.
> The individual jobs load the inverted index at setup() time and do random
> access reads from it as needed.
> A few questions:
> 1. Do others agree that these are the two big classes of solution?
> 1. Do people have a sense of what the pros and cons of each might be?
> (BTW quadratic runtime in the density of the set membership rows is probably
> not a problem; the sets I am dealing with are small relative to the
> vocabulary size and relatively disjoint.)
> If the intersection density is very low then the inverted index approach is
likely to be much better. One helpful tweak would be to put a Bloom filter
in front of the MapFile to avoid probing if you aren't going to get a hit.
Count on about 15-20 bits in the Bloom filter for each element in the
inverted index to get acceptable false positive rates. Can you afford 2-3
bytes of memory use per element in the MapFile?
Another tweak is to sort the input and do the lookups in the reduce. This
will make the accesses to the MapFiles much faster since the accesses will
be much more sequential. This can allow significant caching and might make
the Bloom filter a bit of an albatross.
Finally, this might even be an interesting job for Hbase (storing your
> 1. Is Pig or Hive a good tool to use for solution (1)? (I have a
> feeling the answer might be a 10-line Pig script, but I don't have enough
> SQL experience to just knock one out.)
> 2. For solution (2), will MapFile scale to a map with 10^9 entries?
> (Assuming I use the io.map.index.skip property to make the right
> search-speed/memory tradeoff for my configuration.)