Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Input split for a streaming job!


Copy link to this message
-
Re: Input split for a streaming job!
Hi Raj
          Is your Streaming job using WholeFileInput Format or some Custom
Input Format that reads files as a whole? If that is the case then this is
the expected behavior.
        Also you mentioned you changed the dfs.block.size to 32 Mb.AFAIK
this value would be applicable only for new files into hdfs, the existing
files in hdfs would be having the previous block size itself. Also to test
some scenarios you need to specify the block size on the cluster level, you
specify on the file level while copying the same into hdfs with
hadoop dfs -D dfs.block.size=16777216 -copyFromLocal /src/file /dest/file
Did you do the same way and still it doesn't vary the number of mappers.

AFAIK bzip2 is splittable. Please correct me if I'm wrong.

On Fri, Nov 11, 2011 at 2:07 PM, Anirudh Jhina <[EMAIL PROTECTED]>wrote:

> Raj,
>
> What InputFormat are you using? The compressed format is not splittable, so
> if you have 73 gzip files, there will be 73 corresponding mappers for each
> file respectively. Look at the TextInputFormat.isSplittable() description.
>
> Thanks,
> ~Anirudh
>
> On Thu, Nov 10, 2011 at 2:40 PM, Raj V <[EMAIL PROTECTED]> wrote:
>
> > All
> >
> > I assumed that the input splits for a streaming job will follow the same
> > logic as a map reduce java job but I seem to be wrong.
> >
> > I started out with 73 gzipped files that vary between 23MB to 255MB in
> > size. My default block size was 128MB.  8 of the 73 files are larger than
> > 128 MB
> >
> > When I ran my streaming job, it ran, as expected,  73 mappers ( No
> > reducers for this job).
> >
> > Since I have 128 Nodes in my cluster , I thought I would use more systems
> > in the cluster by increasing the number of mappers. I changed all the
> gzip
> > files into bzip2 files. I expected the number of mappers to increase to
> 81.
> > The mappers remained at 73.
> >
> > I tried a second experiment- I changed my dfs.block.size to 32MB. That
> > should have increased my mappers to about ~250. It remains steadfast at
> 73.
> >
> > Is my understanding wrong? With a smaller block size and bzipped files,
> > should I not get more mappers?
> >
> > Raj
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB