Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Hadoop job using multiple input files


+
Amandeep Khurana 2009-02-06, 09:34
Copy link to this message
-
Re: Hadoop job using multiple input files
Hey Amandeep,

You can get the file name for a task via the "map.input.file" property. For
the join you're doing, you could inspect this property and ouput (number,
name) and (number, address) as your (key, value) pairs, depending on the
file you're working with. Then you can do the combination in your reducer.

You could also check out the join package in contrib/utils (
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html),
but I'd say your job is simple enough that you'll get it done faster with
the above method.

This task would be a simple join in Hive, so you could consider using Hive
to manage the data and perform the join.

Later,
Jeff

On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana <[EMAIL PROTECTED]> wrote:

> Is it possible to write a map reduce job using multiple input files?
>
> For example:
> File 1 has data like - Name, Number
> File 2 has data like - Number, Address
>
> Using these, I want to create a third file which has something like - Name,
> Address
>
> How can a map reduce job be written to do this?
>
> Amandeep
>
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
+
Amandeep Khurana 2009-02-06, 10:17
+
Jeff Hammerbacher 2009-02-06, 13:22
+
Amandeep Khurana 2009-02-06, 22:58
+
Amandeep Khurana 2009-02-07, 00:46
+
Billy Pearson 2009-02-07, 05:32
+
Ian Soboroff 2009-02-06, 13:31
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB