|
|
+
oleksiy 2011-04-10, 21:10
+
Mehmet Tepedelenlioglu 2011-04-10, 23:29
-
Re: Architectural questionTed Dunning 2011-04-11, 02:08
The original poster said that there was no common key. Your suggestion
presupposes that such a key exists. On Sun, Apr 10, 2011 at 4:29 PM, Mehmet Tepedelenlioglu < [EMAIL PROTECTED]> wrote: > My understanding is you have two sets of strings S1, and S2 and you want to > mark all strings that > belong to both sets. If this is correct, then: > > Mapper: for all strings K in Si (i is 1 or 2) emit: key K and value i. > Reducer: For key K, if the list of values includes both 1 and 2, you have a > match, emit: K MATCH, else emit: K NO_MATCH (or nothing). > > I assume that the load is not terribly unbalanced. The logic goes for > intersection of any number of sets. Mark the members with their sets, reduce > over them to see if they belong to every set. > > Good luck. > > > On Apr 10, 2011, at 2:10 PM, oleksiy wrote: > > > > > Hi all, > > I have some architectural question. > > For my app I have persistent 50 GB data, which stored in HDFS, data is > > simple CSV format file. > > Also for my app which should be run over this (50 GB) data I have 10 GB > > input data also CSV format. > > Persistent data and input data don't have commons keys. > > > > In my cluster I have 5 data nodes. > > The app does simple match every line of input data with every line of > > persistent data. > > > > For solving this task I see two different approaches: > > 1. Destribute input file to every node using attribute -files, and run > job. > > But in this case every map will go through 10 GB input data. > > 2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent > > jobs (one per data node for instance), and for every job we will put 2 GB > > data. In this case every map should go through 2 GB data. In other words > > I'll give every map node it's own input data. But drawback of this > approache > > is work which I should do before start job and after job finished. > > > > And may be there is more subtle way in hadoop to do this work? > > > > -- > > View this message in context: > http://old.nabble.com/Architectural-question-tp31365863p31365863.html > > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > +
sumit ghosh 2011-04-11, 08:41
+
Mehmet Tepedelenlioglu 2011-04-11, 15:42
+
Ted Dunning 2011-04-11, 02:07
+
Daniel McEnnis 2011-04-11, 02:16
|