Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: How to create Output files of about fixed size


Copy link to this message
-
Re: How to create Output files of about fixed size
Hi JJ
       If you use the default TextInputFormat, it wont do the job as it
would generate at least one split for each file. So in your case there
would be a min of 78 splits as there are that many input files and 78
mappers and hence same 78 output files. You need to use
CombineFileInputFormat  to combine more files into a single split. Also you
need to specify the value mapred.max.split.size to the required size of
output file.

So in short if you require 1G output files, your aggregation map only job
should contain the following arguments
-D mapred.input.format.class = org.apache. .... .CombineFileInputFormat
-D mapred.max.split.size=1073741824
-D mapred.reduce.tasks=0

Hope it helps!..

Regards
Bejoy.K.S

On Wed, Dec 21, 2011 at 7:15 AM, Mapred Learn <[EMAIL PROTECTED]>wrote:

> Hi Shevek/others,
>
> I tried this.
>
> First job created about 78 files of each 15 MB size.
>
> I tried a second map only job with IdentityMapper with
> -Dmapred.min.split.size=1073741824  but it did not cause output files to be
> 1 Gb each but same output as above i.e. 78 files of 15 MB size.
>
> Is there a way to combine about files to 1 GB size each ?
>
> Thanks,
> -JJ
>
> On Fri, Oct 28, 2011 at 9:53 AM, Shevek <[EMAIL PROTECTED]> wrote:
>
> > If you run it as a pure map job, it will do it per split. If you run it
> as
> > a
> > single reducer job, it will do it overall. However, one starts to suspect
> > that by the time you've paid that extra cost, you might as well
> reconsider
> > your downstream process and the reason for this subdivision.
> >
> > S.
> >
> > On 27 October 2011 23:07, Mapred Learn <[EMAIL PROTECTED]> wrote:
> >
> > > Hi Shevek,
> > > Thanks for the explanation !
> > >
> > > Can you point me to some documentatino for specifying size in output
> > format
> > > ?
> > >
> > > If i say size as 200 MB, then after 200 mb, it would do this per split
> or
> > > overall ?
> > > I mena would I end up with 200 mb and a 50 mb from 1st mapper and then,
> > say
> > > 200 mb and 10 mb from 2nd mapper and so on. Or will I get 200 mb files
> > only
> > > ?
> > >
> > >
> > >
> > > On Wed, Oct 26, 2011 at 10:48 AM, Shevek <[EMAIL PROTECTED]>
> wrote:
> > >
> > > > You can control the input to a computer program, but not
> (arbitrarily)
> > > how
> > > > much output it generates. The only way to generate output files of a
> > > fixed
> > > > size is to write a custom output format which shifts to a new
> filename
> > > > every
> > > > time that size is exceeded, but you will still get some small bits
> left
> > > > over. The plumbing in this is pretty ugly, and I would not recommend
> it
> > > > casually.
> > > >
> > > > You may be able to write a second map-only job which reprocesses the
> > > output
> > > > from the first job in chunks of X bytes, and just writes them out.
> Use
> > an
> > > > IdentityMapper and set the split size. I have not tried this at home.
> > > >
> > > > S.
> > > >
> > > > On 26 October 2011 07:03, Mapred Learn <[EMAIL PROTECTED]>
> wrote:
> > > >
> > > > >
> > > > > >
> > > > >
> > > > > > Hi,
> > > > > > I am trying to create output files of fixed size by using :
> > > > > > -Dmapred.max.split.size=6442450812 (6 Gb)
> > > > > >
> > > > > > But the problem is that the input Data size and metadata varies
> >  and
> > > I
> > > > > have to adjust above value manually to achieve fixed size.
> > > > > >
> > > > > > Is there a way I can programmatically determine split size that
> > would
> > > > > yield me fixed sized output files. For eg 200 MB each ?
> > > > > >
> > > > > > Thanks,
> > > > > > JJ
> > > > >
> > > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB