There are no subtle ways to deal with quadratic problems like this. They
just don't scale.
Your suggestions are roughly on course. When matching 10GB against 50GB,
the choice of which input to use as input to the mapper depends a lot on how
much you can buffer in memory and how long such a buffer takes to build.
If you can't store the entire 10GB of data in memory at once, then consider
a program like this:
a) split the 50GB of data across as many mappers as you have using standard
b) in the mapper, emit each record several times with keys of the form (i,
j) where i cycles through [0,n) and is incremented once for each record read
and j cycles through [0, m) and is incremented each time you emit a record.
Choose m so that 1/m of your 10GB data will fit in your reducers memory.
Choose n so that n x m is as large as your desired number of reducers.
c) in the reducer, you will get some key (i,j) and an iterator for a number
of records. Read the i-th segment of your 10GB data and compare each of
the records that the iterator gives you to that data. If you made n = 1 in
step (b), then you will have at most m-way parallelism in this step. If n
is large, however, your reducer may need to read the same segment of your
10GB data more than once. In such conditions you may want to sort the
records and remember which segment you have already read.
In general, though, as I mentioned this is not a scalable process and as
your data grows it is likely to become untenable.
If you can split your data into pieces and estimate which piece each record
should be matched to then you might be able to make the process more
scalable. Consider indexing techniques to do this rough targeting. For
instance, if you are trying to find the closes few strings based on edit
distance, you might be able to use n-grams to get approximate matches via a
text retrieval index. This can substantially reduce the cost of your
On Sun, Apr 10, 2011 at 2:10 PM, oleksiy <[EMAIL PROTECTED]> wrote:
> ... Persistent data and input data don't have commons keys.
> In my cluster I have 5 data nodes.
> The app does simple match every line of input data with every line of
> persistent data.
> And may be there is more subtle way in hadoop to do this work?