Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS, mail # user - RE: spawn maps without any input data - hadoop streaming


Copy link to this message
-
RE: spawn maps without any input data - hadoop streaming
Devaraj k 2013-07-17, 03:30
Hi Austin,

                Here number of maps  for a Job  depends on the splits return by InputFormat.getSplits() API. We can have an input format which decides the number of maps(by returning the splits) for a Job according to the need.

If we use FileInputFormat, these number of splits depend on the input data for the Job, that's why you see no of mappers is proportional to the Job input size.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/InputFormat.html#getSplits(org.apache.hadoop.mapreduce.JobContext)

Thanks
Devaraj k

From: Austin Chungath [mailto:[EMAIL PROTECTED]]
Sent: 16 July 2013 14:40
To: [EMAIL PROTECTED]
Subject: spawn maps without any input data - hadoop streaming

Hi,

I am trying to generate random data using hadoop streaming & python. It's a map only job and I need to run a number of maps. There is no input to the map as it's just going to generate random data.

How do I specify the number of maps to run? ( I am confused here because, if I am not wrong, the number of maps spawned is related to the input data size )
Any ideas as to how this can be done?

Warm regards,
Austin