Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # dev >> Basic Hadoop Doubt


Copy link to this message
-
Re: Basic Hadoop Doubt
Vamsi,

>
> I have some basic doubt on hadoop Input Data placement...
>
> Like, If i input some 30GB of data to hadoop program , it will place the
> 30gb into HDFS  into some set of files based on some input formats..

Conceptually, it would be more accurate to say that it splits the data
into 'blocks' that are managed in HDFS. Of course,
implementation-wise, these blocks do get stored on physical files on
the datanodes.

>
> I have 2 doubts here ..
>
> 1. Each time i run a program 30GB is placed into HDFS or how its going to
> Work

What program ? Are you talking about this 30GB as input to the program
or output from it ? Assuming Map/Reduce input, the answer is in
general, no. A typical M/R program takes an input path on DFS and this
can point to data that's already copied to DFS, independent of the
program itself.

> 2. Again if i want to run some other program on another 100Gb of data, where
> the above stated data and program is different. Then the previous 30GB is
> erased in HDFS or how its going to run..
>

Given that the program and its input are independent, the program will
not modify any existing data. In fact most Map/Reduce applications do
not overwrite output data as well. Rather, they will refuse to start
if the output directory already exists.

Thanks
Hemanth
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB