Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Merging files


+
Something Something 2013-07-31, 05:26
+
Ben Juhn 2013-07-31, 05:34
+
Something Something 2013-07-31, 06:40
+
John Meagher 2013-07-31, 13:28
+
Something Something 2013-07-31, 16:21
+
j.barrett Strausser 2013-07-31, 16:42
+
John Meagher 2013-07-31, 17:28
+
Something Something 2013-07-31, 20:39
That is what I was suggesting yes.
On Wed, Jul 31, 2013 at 4:39 PM, Something Something <
[EMAIL PROTECTED]> wrote:

> So you are saying, we will first do a 'hadoop count' to get the total # of
> bytes for all files.  Let's say that comes to:  1538684305
>
> Default Block Size is:  128M
>
> So, total # of blocks needed:  1538684305 / 131072 = 11740
>
> Max file blocks = 11740 / 50 (# of output files) = 234
>
> Does this calculation look right?
>
> On Wed, Jul 31, 2013 at 10:28 AM, John Meagher <[EMAIL PROTECTED]
> >wrote:
>
> > It is file size based, not file count based.  For fewer files up the
> > max-file-blocks setting.
> >
> > On Wed, Jul 31, 2013 at 12:21 PM, Something Something
> > <[EMAIL PROTECTED]> wrote:
> > > Thanks, John.  But I don't see an option to specify the # of output
> > files.
> > >  How does Crush decide how many files to create?  Is it only based on
> > file
> > > sizes?
> > >
> > > On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <[EMAIL PROTECTED]
> > >wrote:
> > >
> > >> Here's a great tool for handling exactly that case:
> > >> https://github.com/edwardcapriolo/filecrush
> > >>
> > >> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> > >> <[EMAIL PROTECTED]> wrote:
> > >> > Each bz2 file after merging is about 50Megs.  The reducers take
> about
> > 9
> > >> > minutes.
> > >> >
> > >> > Note:  'getmerge' is not an option.  There isn't enough disk space
> to
> > do
> > >> a
> > >> > getmerge on the local production box.  Plus we need a scalable
> > solution
> > >> as
> > >> > these files will get a lot bigger soon.
> > >> >
> > >> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <[EMAIL PROTECTED]>
> > wrote:
> > >> >
> > >> >> How big are your 50 files?  How long are the reducers taking?
> > >> >>
> > >> >> On Jul 30, 2013, at 10:26 PM, Something Something <
> > >> >> [EMAIL PROTECTED]> wrote:
> > >> >>
> > >> >> > Hello,
> > >> >> >
> > >> >> > One of our pig scripts creates over 500 small part files.  To
> save
> > on
> > >> >> > namespace, we need to cut down the # of files, so instead of
> saving
> > >> 500
> > >> >> > small files we need to merge them into 50.  We tried the
> following:
> > >> >> >
> > >> >> > 1)  When we set parallel number to 50, the Pig script takes a
> long
> > >> time -
> > >> >> > for obvious reasons.
> > >> >> > 2)  If we use Hadoop Streaming, it puts some garbage values into
> > the
> > >> key
> > >> >> > field.
> > >> >> > 3)  We wrote our own Map Reducer program that reads these 500
> small
> > >> part
> > >> >> > files & uses 50 reducers.  Basically, the Mappers simply write
> the
> > >> line &
> > >> >> > reducers loop thru values & write them out.  We set
> > >> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
> > >> written
> > >> >> to
> > >> >> > the output file.  This is performing better than Pig.  Actually
> > >> Mappers
> > >> >> run
> > >> >> > very fast, but Reducers take some time to complete, but this
> > approach
> > >> >> seems
> > >> >> > to be working well.
> > >> >> >
> > >> >> > Is there a better way to do this?  What strategy can you think of
> > to
> > >> >> > increase speed of reducers.
> > >> >> >
> > >> >> > Any help in this regard will be greatly appreciated.  Thanks.
> > >> >>
> > >> >>
> > >>
> >
>

--
https://github.com/bearrito
@deepbearrito
+
Hailey Charlie 2013-07-31, 05:32
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB