Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Hadoop job using multiple input files


Copy link to this message
-
Re: Hadoop job using multiple input files
You put the files into a common directory, and use that as your input to the
MapReduce job. You write a single Mapper class that has an "if" statement
examining the map.input.file property, outputting "number" as the key for
both files, but "address" for one and "name" for the other. By using a
commone key ("number"), you'll  ensure that the name and address make it to
the same reducer after the shuffle. In the reducer, you'll then have the
relevant information (in the values) you need to create the name, address
pair.

On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana <[EMAIL PROTECTED]> wrote:

> Thanks Jeff...
> I am not 100% clear about the first solution you have given. How do I get
> the multiple files to be read and then feed into a single reducer? I should
> have multiple mappers in the same class and have different job configs for
> them, run two separate jobs with one outputing the key as (name,number) and
> the other outputing the value as (number, address) into the reducer?
> Not clear what I'll be doing with the map.intput.file here...
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher <[EMAIL PROTECTED]
> >wrote:
>
> > Hey Amandeep,
> >
> > You can get the file name for a task via the "map.input.file" property.
> For
> > the join you're doing, you could inspect this property and ouput (number,
> > name) and (number, address) as your (key, value) pairs, depending on the
> > file you're working with. Then you can do the combination in your
> reducer.
> >
> > You could also check out the join package in contrib/utils (
> >
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html
> > ),
> > but I'd say your job is simple enough that you'll get it done faster with
> > the above method.
> >
> > This task would be a simple join in Hive, so you could consider using
> Hive
> > to manage the data and perform the join.
> >
> > Later,
> > Jeff
> >
> > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana <[EMAIL PROTECTED]>
> wrote:
> >
> > > Is it possible to write a map reduce job using multiple input files?
> > >
> > > For example:
> > > File 1 has data like - Name, Number
> > > File 2 has data like - Number, Address
> > >
> > > Using these, I want to create a third file which has something like -
> > Name,
> > > Address
> > >
> > > How can a map reduce job be written to do this?
> > >
> > > Amandeep
> > >
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> >
>