Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Architectural question


Copy link to this message
-
Re: Architectural question
The original poster said that there was no common key.  Your suggestion
presupposes that such a key exists.

On Sun, Apr 10, 2011 at 4:29 PM, Mehmet Tepedelenlioglu <
[EMAIL PROTECTED]> wrote:

> My understanding is you have two sets of strings S1, and S2 and you want to
> mark all strings that
> belong to both sets. If this is correct, then:
>
> Mapper: for all strings K in Si (i is 1 or 2) emit: key K and value i.
> Reducer: For key K, if the list of values includes both 1 and 2, you have a
> match, emit: K MATCH, else emit: K NO_MATCH (or nothing).
>
> I assume that the load is not terribly unbalanced. The logic goes for
> intersection of any number of sets. Mark the members with their sets, reduce
> over them to see if they belong to every set.
>
> Good luck.
>
>
> On Apr 10, 2011, at 2:10 PM, oleksiy wrote:
>
> >
> > Hi all,
> > I have some architectural question.
> > For my app I have persistent 50 GB data, which stored in HDFS, data is
> > simple CSV format file.
> > Also for my app which should be run over this (50 GB) data I have 10 GB
> > input data also CSV format.
> > Persistent data and input data don't have commons keys.
> >
> > In my cluster I have 5 data nodes.
> > The app does simple match every line of input data with every line of
> > persistent data.
> >
> > For solving this task I see two different approaches:
> > 1. Destribute input file to every node using attribute -files, and run
> job.
> > But in this case every map will go through 10 GB input data.
> > 2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent
> > jobs (one per data node for instance), and for every job we will put 2 GB
> > data. In this case every map should go through 2 GB data. In other words
> > I'll give every map node it's own input data. But drawback of this
> approache
> > is work which I should do before start job and after job finished.
> >
> > And may be there is more subtle way in hadoop to do this work?
> >
> > --
> > View this message in context:
> http://old.nabble.com/Architectural-question-tp31365863p31365863.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB