-Re: Reading multiple input files.
Kim Chew 2014-01-10, 22:18
Yong, this is very helpful! Thanks.
Still trying to wrap my around all this :)
Let's stick to this hypothetical scenario that my data files are located on
different server, for example,
So how does Hadoop determine how many mapper does it need? Can I run my job
hadoop MyJob -input /foo -output output
On Fri, Jan 10, 2014 at 8:04 AM, java8964 <[EMAIL PROTECTED]> wrote:
> The hadoop is very flexible for underline storage system. It is in your
> control, how to utilize the cluster's resource, include CPU, memory, IO
> and network bandwidth.
> Check out hadoop NLineInportFormat, it maybe the right choice for your
> You can put all the metadata of your files (data) into one text file, and
> send this text file to your MR job.
> Each mapper will get one line text from the above file, and start to
> process data representing by this one line text.
> Is it a good solution for you? You have to judge it by yourself. Keep in
> mind followings:
> 1) Normally, the above case is good for a MR job to load data from a third
> party system, for CPU intensive jobs.
> 2) You do utilize the cluster, as if you have 100 mapper tasks, and 100
> files to be processed, you get pretty good concurrency.
> 1) Are your files (or data) equally split around the third party system?
> In the above example, for 100 files (or chunks of data), if one file is
> 10G, and the rest are only 100M, then one mapper will take MUCH longer than
> the rest. You will have lone tail problem, and hurt overall performance.
> 2) NO data locality advantage compared to HDFS. All the mappers need to
> load the data from a third party system remotely.
> 3) If each file (or chunk data) are very large, what about fail over? For
> example, if you have 100 mapper task slots, but only 20 files, with 10G
> data each, then you under-utilize your cluster resource, as only 20 mappers
> will handle them, the rest 80 mapper tasks will be just idle. More
> important, if one mapper failed, all the already processed data has to be
> discard. Another mapper has to restart from beginning for this chunk of
> data. Your overall performance is hurt.
> As you can see, you get a lot of benefits from the HDFS. You lost all of
> them. Sometimes you have no other choices, but have to load the data on the
> fly from some 3rd party system. But you need to think above, and try to
> seek all the benefits which HDFS can provide to you, from the 3rd party
> system, if you can.
> Date: Fri, 10 Jan 2014 01:21:19 -0800
> Subject: Reading multiple input files.
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> How does a MR job read multiple input files from different locations?
> What if the input files are not in hdfs and located on different servers?
> Do I have to copy them to hdfs first and instruct my MR job to read from
> them? Can I instruct my MR job to read directly from those servers?