Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Hadoop setup doubts


Copy link to this message
-
Re: Hadoop setup doubts
Hi,

> 2.       How does log aggregation work?
>
http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/

> 4.       What is the purpose of the webproxy? Is it really required?
>
http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html
> 5.       Is there any documentation on how to decide which scheduler type
> based on certain parameters?
>
I am not sure, if I fully understand the question.
You can use only one scheduler at the same time. On run-time, you can
decided which pool or queue, your job should be submitted to, if you use
Fair or Capacity schedule.

> 6.       What is the recommended way of pushing  data into Hadoop cluster
> & submitting  mapred jobs, i.e should we use another client  node, if so is
> there any client daemon to run on it ?
>
> ---- Do you have experiance with UNIX, if so hadoop commands are similer
> to UNIX commands. Ex. below command works fine for me.
>
> hdfs dfs -copyFromLocal <localfiledir> <hdfs file directory>
>
Usually, we push data to the cluster + submit mapreduce jobs, from machines
called "edgenodes". In Hadoop, the edgenode is a machine where the hadoop
client libraries are installed (+ pig, hive, sqoop etc, if you want to use
them), but no Hadoop daemon is running.

Hope this helps a bit!
> On Sat, Dec 14, 2013 at 4:03 PM, Indranil Majumder (imajumde) <
> [EMAIL PROTECTED]> wrote:
>
>>  I stared with Hadoop few days ago, I do have few doubts on the setup,
>>
>>
>>
>> 1.       For name node I do format the name directory, is it recommended
>> to do the same for the data node directories too.
>>
>> 2.       How does log aggregation work?
>>
>> 3.       Does resource manager run on every node (both Name and Data) or
>> it can run as a separate node?
>>
>> 4.       What is the purpose of the webproxy? Is it really required?
>>
>> 5.       Is there any documentation on how to decide which scheduler
>> type based on certain parameters?
>>
>> 6.       What is the recommended way of pushing  data into Hadoop
>> cluster & submitting  mapred jobs, i.e should we use another client  node,
>> if so is there any client daemon to run on it ?
>>
>> 7.       For the following nodes in clustered mode
>>
>> A.      NameNode
>>
>> B.      Secondary NameNode
>>
>> C.      DataNode (2)
>>
>> D.      Resource Manager
>>
>> E.       WebProxy
>>
>> F.       History Server( Map Reduce )
>>
>> I want to write a PID monitor. Does anybody has the list of processes
>> that would run on this clusters when fully operational [may be output of ps
>> –ef | grep “somekeyword” will do]
>>
>>
>>
>> Thanks & Regards,
>>
>> Indranil
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB