MapReduce, mail # user - question on FileInputFormat.addInputPath and data access

question on FileInputFormat.addInputPath and data access
Kartashov, Andy 2012-10-24, 14:23
Two questions:

1.       Say you have 5 folders with input data (fold1,fold2,fold3,....,fold5) in you hdfs in pseudo-dist mode cluster.
You will write your MR job to access your files by listing them in :
FileInputFormat.addInputPaths(job, "fold1, fold2, fold3...,fold5");
Q: Is there a way to move the above folders to the parent folder say, "the_folder", so that the dir struct will be the_folder/fold1, the_folder/fold2... Will it be possible to access your files with something like: FileInputFormat.addInputPaths(job, "the_fold1/*"); or similar?
I am asking in case your input folders list grows too long. How to curb that?

2.       Hypothetically speaking  in fully-dist mode cluster your folders with Data are located as follows:  Node1: (fold1,fold2,fold3) and  Node2:(fold4, fold5)

Q: Do we change below command  or will NN and JT  take care how of locating those files?
FileInputFormat.addInputPaths(job, "fold1, fold2, fold3...,fold5");
     2a.     Using Data balancer which splits input/moves Data across additional DNs indicated in conf/slaves,  is it possible to run "hdfs dfs -ls -r " command  on the slave node that runs DN on a separate machine? I have



