Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Reading multiple input files.


Copy link to this message
-
Re: Reading multiple input files.
Yong, this is very helpful! Thanks.

Still trying to wrap my around all this :)

Let's stick to this hypothetical scenario that my data files are located on
different server, for example,

machine-1:/foo/bar.txt
machine-2:/foo/bar.txt
machine-3:/foo/bar.txt
machine-4:/foo/bar.txt
machine-5:/foo/bar.txt
...........

So how does Hadoop determine how many mapper does it need? Can I run my job
like this?

hadoop MyJob -input /foo -output output

Kim
On Fri, Jan 10, 2014 at 8:04 AM, java8964 <[EMAIL PROTECTED]> wrote:

> Yes.
>
> The hadoop is very flexible for underline storage system. It is in your
> control,  how to utilize the cluster's resource, include CPU, memory, IO
> and network bandwidth.
>
> Check out hadoop NLineInportFormat, it maybe the right choice for your
> case.
>
> You can put all the metadata of your files (data) into one text file, and
> send this text file to your MR job.
>
> Each mapper will get one line text from the above file, and start to
> process data representing by this one line text.
>
> Is it a good solution for you? You have to judge it by yourself. Keep in
> mind followings:
>
> 1) Normally, the above case is good for a MR job to load data from a third
> party system, for CPU intensive jobs.
> 2) You do utilize the cluster, as if you have 100 mapper tasks, and 100
> files to be processed, you get pretty good concurrency.
>
> But:
>
> 1) Are your files (or data) equally split around the third party system?
> In the above example, for 100 files (or chunks of data), if one file is
> 10G, and the rest are only 100M, then one mapper will take MUCH longer than
> the rest. You will have lone tail problem, and hurt overall performance.
> 2) NO data locality advantage compared to HDFS. All the mappers need to
> load the data from a third party system remotely.
> 3) If each file (or chunk data) are very large, what about fail over? For
> example, if you have 100 mapper task slots, but only 20 files, with 10G
> data each, then you under-utilize your cluster resource, as only 20 mappers
> will handle them, the rest 80 mapper tasks will be just idle. More
> important, if one mapper failed, all the already processed data has to be
> discard. Another mapper has to restart from beginning for this chunk of
> data. Your overall performance is hurt.
>
> As you can see, you get a lot of benefits from the HDFS.  You lost all of
> them. Sometimes you have no other choices, but have to load the data on the
> fly from some 3rd party system. But you need to think above, and try to
> seek all the benefits which HDFS can provide to you, from the 3rd party
> system, if you can.
>
> Yong
>
> ------------------------------
> Date: Fri, 10 Jan 2014 01:21:19 -0800
> Subject: Reading multiple input files.
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
>
>
> How does a MR job read multiple input files from different locations?
>
> What if the input files are not in hdfs and located on different servers?
> Do I have to copy them to hdfs first and instruct my MR job to read from
> them? Can I instruct my MR job to read directly from those servers?
>
> Thanks.
>
> Kim
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB