Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Architectural question


+
oleksiy 2011-04-10, 21:10
+
Mehmet Tepedelenlioglu 2011-04-10, 23:29
Copy link to this message
-
Re: Architectural question
The original poster said that there was no common key.  Your suggestion
presupposes that such a key exists.

On Sun, Apr 10, 2011 at 4:29 PM, Mehmet Tepedelenlioglu <
[EMAIL PROTECTED]> wrote:

> My understanding is you have two sets of strings S1, and S2 and you want to
> mark all strings that
> belong to both sets. If this is correct, then:
>
> Mapper: for all strings K in Si (i is 1 or 2) emit: key K and value i.
> Reducer: For key K, if the list of values includes both 1 and 2, you have a
> match, emit: K MATCH, else emit: K NO_MATCH (or nothing).
>
> I assume that the load is not terribly unbalanced. The logic goes for
> intersection of any number of sets. Mark the members with their sets, reduce
> over them to see if they belong to every set.
>
> Good luck.
>
>
> On Apr 10, 2011, at 2:10 PM, oleksiy wrote:
>
> >
> > Hi all,
> > I have some architectural question.
> > For my app I have persistent 50 GB data, which stored in HDFS, data is
> > simple CSV format file.
> > Also for my app which should be run over this (50 GB) data I have 10 GB
> > input data also CSV format.
> > Persistent data and input data don't have commons keys.
> >
> > In my cluster I have 5 data nodes.
> > The app does simple match every line of input data with every line of
> > persistent data.
> >
> > For solving this task I see two different approaches:
> > 1. Destribute input file to every node using attribute -files, and run
> job.
> > But in this case every map will go through 10 GB input data.
> > 2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent
> > jobs (one per data node for instance), and for every job we will put 2 GB
> > data. In this case every map should go through 2 GB data. In other words
> > I'll give every map node it's own input data. But drawback of this
> approache
> > is work which I should do before start job and after job finished.
> >
> > And may be there is more subtle way in hadoop to do this work?
> >
> > --
> > View this message in context:
> http://old.nabble.com/Architectural-question-tp31365863p31365863.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
>
>
+
sumit ghosh 2011-04-11, 08:41
+
Mehmet Tepedelenlioglu 2011-04-11, 15:42
+
Ted Dunning 2011-04-11, 02:07
+
Daniel McEnnis 2011-04-11, 02:16