Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Reading multiple input files.


Copy link to this message
-
Re: Reading multiple input files.
Kim Chew 2014-01-10, 22:18
Yong, this is very helpful! Thanks.

Still trying to wrap my around all this :)

Let's stick to this hypothetical scenario that my data files are located on
different server, for example,

machine-1:/foo/bar.txt
machine-2:/foo/bar.txt
machine-3:/foo/bar.txt
machine-4:/foo/bar.txt
machine-5:/foo/bar.txt
...........

So how does Hadoop determine how many mapper does it need? Can I run my job
like this?

hadoop MyJob -input /foo -output output

Kim
On Fri, Jan 10, 2014 at 8:04 AM, java8964 <[EMAIL PROTECTED]> wrote:

> Yes.
>
> The hadoop is very flexible for underline storage system. It is in your
> control,  how to utilize the cluster's resource, include CPU, memory, IO
> and network bandwidth.
>
> Check out hadoop NLineInportFormat, it maybe the right choice for your
> case.
>
> You can put all the metadata of your files (data) into one text file, and
> send this text file to your MR job.
>
> Each mapper will get one line text from the above file, and start to
> process data representing by this one line text.
>
> Is it a good solution for you? You have to judge it by yourself. Keep in
> mind followings:
>
> 1) Normally, the above case is good for a MR job to load data from a third
> party system, for CPU intensive jobs.
> 2) You do utilize the cluster, as if you have 100 mapper tasks, and 100
> files to be processed, you get pretty good concurrency.
>
> But:
>
> 1) Are your files (or data) equally split around the third party system?
> In the above example, for 100 files (or chunks of data), if one file is
> 10G, and the rest are only 100M, then one mapper will take MUCH longer than
> the rest. You will have lone tail problem, and hurt overall performance.
> 2) NO data locality advantage compared to HDFS. All the mappers need to
> load the data from a third party system remotely.
> 3) If each file (or chunk data) are very large, what about fail over? For
> example, if you have 100 mapper task slots, but only 20 files, with 10G
> data each, then you under-utilize your cluster resource, as only 20 mappers
> will handle them, the rest 80 mapper tasks will be just idle. More
> important, if one mapper failed, all the already processed data has to be
> discard. Another mapper has to restart from beginning for this chunk of
> data. Your overall performance is hurt.
>
> As you can see, you get a lot of benefits from the HDFS.  You lost all of
> them. Sometimes you have no other choices, but have to load the data on the
> fly from some 3rd party system. But you need to think above, and try to
> seek all the benefits which HDFS can provide to you, from the 3rd party
> system, if you can.
>
> Yong
>
> ------------------------------
> Date: Fri, 10 Jan 2014 01:21:19 -0800
> Subject: Reading multiple input files.
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
>
>
> How does a MR job read multiple input files from different locations?
>
> What if the input files are not in hdfs and located on different servers?
> Do I have to copy them to hdfs first and instruct my MR job to read from
> them? Can I instruct my MR job to read directly from those servers?
>
> Thanks.
>
> Kim
>