> I have some basic doubt on hadoop Input Data placement...
> Like, If i input some 30GB of data to hadoop program , it will place the
> 30gb into HDFS into some set of files based on some input formats..
Conceptually, it would be more accurate to say that it splits the data
into 'blocks' that are managed in HDFS. Of course,
implementation-wise, these blocks do get stored on physical files on
> I have 2 doubts here ..
> 1. Each time i run a program 30GB is placed into HDFS or how its going to
What program ? Are you talking about this 30GB as input to the program
or output from it ? Assuming Map/Reduce input, the answer is in
general, no. A typical M/R program takes an input path on DFS and this
can point to data that's already copied to DFS, independent of the
> 2. Again if i want to run some other program on another 100Gb of data, where
> the above stated data and program is different. Then the previous 30GB is
> erased in HDFS or how its going to run..
Given that the program and its input are independent, the program will
not modify any existing data. In fact most Map/Reduce applications do
not overwrite output data as well. Rather, they will refuse to start
if the output directory already exists.