The original posting said - "The app does simple match every line of input data
with every line of persistent data." Hence the "key" should be replaced by a
String from the 10 GB store or a hash of it. Hence, we can match it with the
hash or String from the persistent Store.
From: Ted Dunning <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Mon, 11 April, 2011 7:38:04 AM
Subject: Re: Architectural question
The original poster said that there was no common key. Your suggestion
presupposes that such a key exists.
On Sun, Apr 10, 2011 at 4:29 PM, Mehmet Tepedelenlioglu <
[EMAIL PROTECTED]> wrote:
> My understanding is you have two sets of strings S1, and S2 and you want to
> mark all strings that
> belong to both sets. If this is correct, then:
> Mapper: for all strings K in Si (i is 1 or 2) emit: key K and value i.
> Reducer: For key K, if the list of values includes both 1 and 2, you have a
> match, emit: K MATCH, else emit: K NO_MATCH (or nothing).
> I assume that the load is not terribly unbalanced. The logic goes for
> intersection of any number of sets. Mark the members with their sets, reduce
> over them to see if they belong to every set.
> Good luck.
> On Apr 10, 2011, at 2:10 PM, oleksiy wrote:
> > Hi all,
> > I have some architectural question.
> > For my app I have persistent 50 GB data, which stored in HDFS, data is
> > simple CSV format file.
> > Also for my app which should be run over this (50 GB) data I have 10 GB
> > input data also CSV format.
> > Persistent data and input data don't have commons keys.
> > In my cluster I have 5 data nodes.
> > The app does simple match every line of input data with every line of
> > persistent data.
> > For solving this task I see two different approaches:
> > 1. Destribute input file to every node using attribute -files, and run
> > But in this case every map will go through 10 GB input data.
> > 2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent
> > jobs (one per data node for instance), and for every job we will put 2 GB
> > data. In this case every map should go through 2 GB data. In other words
> > I'll give every map node it's own input data. But drawback of this
> > is work which I should do before start job and after job finished.
> > And may be there is more subtle way in hadoop to do this work?
> > --
> > View this message in context:
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.