Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Merging files


+
Something Something 2013-07-31, 05:26
+
Ben Juhn 2013-07-31, 05:34
+
Something Something 2013-07-31, 06:40
+
John Meagher 2013-07-31, 13:28
+
Something Something 2013-07-31, 16:21
+
j.barrett Strausser 2013-07-31, 16:42
+
John Meagher 2013-07-31, 17:28
+
Something Something 2013-07-31, 20:39
Copy link to this message
-
Re: Merging files
j.barrett Strausser 2013-07-31, 21:01
That is what I was suggesting yes.
On Wed, Jul 31, 2013 at 4:39 PM, Something Something <
[EMAIL PROTECTED]> wrote:

> So you are saying, we will first do a 'hadoop count' to get the total # of
> bytes for all files.  Let's say that comes to:  1538684305
>
> Default Block Size is:  128M
>
> So, total # of blocks needed:  1538684305 / 131072 = 11740
>
> Max file blocks = 11740 / 50 (# of output files) = 234
>
> Does this calculation look right?
>
> On Wed, Jul 31, 2013 at 10:28 AM, John Meagher <[EMAIL PROTECTED]
> >wrote:
>
> > It is file size based, not file count based.  For fewer files up the
> > max-file-blocks setting.
> >
> > On Wed, Jul 31, 2013 at 12:21 PM, Something Something
> > <[EMAIL PROTECTED]> wrote:
> > > Thanks, John.  But I don't see an option to specify the # of output
> > files.
> > >  How does Crush decide how many files to create?  Is it only based on
> > file
> > > sizes?
> > >
> > > On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <[EMAIL PROTECTED]
> > >wrote:
> > >
> > >> Here's a great tool for handling exactly that case:
> > >> https://github.com/edwardcapriolo/filecrush
> > >>
> > >> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> > >> <[EMAIL PROTECTED]> wrote:
> > >> > Each bz2 file after merging is about 50Megs.  The reducers take
> about
> > 9
> > >> > minutes.
> > >> >
> > >> > Note:  'getmerge' is not an option.  There isn't enough disk space
> to
> > do
> > >> a
> > >> > getmerge on the local production box.  Plus we need a scalable
> > solution
> > >> as
> > >> > these files will get a lot bigger soon.
> > >> >
> > >> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <[EMAIL PROTECTED]>
> > wrote:
> > >> >
> > >> >> How big are your 50 files?  How long are the reducers taking?
> > >> >>
> > >> >> On Jul 30, 2013, at 10:26 PM, Something Something <
> > >> >> [EMAIL PROTECTED]> wrote:
> > >> >>
> > >> >> > Hello,
> > >> >> >
> > >> >> > One of our pig scripts creates over 500 small part files.  To
> save
> > on
> > >> >> > namespace, we need to cut down the # of files, so instead of
> saving
> > >> 500
> > >> >> > small files we need to merge them into 50.  We tried the
> following:
> > >> >> >
> > >> >> > 1)  When we set parallel number to 50, the Pig script takes a
> long
> > >> time -
> > >> >> > for obvious reasons.
> > >> >> > 2)  If we use Hadoop Streaming, it puts some garbage values into
> > the
> > >> key
> > >> >> > field.
> > >> >> > 3)  We wrote our own Map Reducer program that reads these 500
> small
> > >> part
> > >> >> > files & uses 50 reducers.  Basically, the Mappers simply write
> the
> > >> line &
> > >> >> > reducers loop thru values & write them out.  We set
> > >> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
> > >> written
> > >> >> to
> > >> >> > the output file.  This is performing better than Pig.  Actually
> > >> Mappers
> > >> >> run
> > >> >> > very fast, but Reducers take some time to complete, but this
> > approach
> > >> >> seems
> > >> >> > to be working well.
> > >> >> >
> > >> >> > Is there a better way to do this?  What strategy can you think of
> > to
> > >> >> > increase speed of reducers.
> > >> >> >
> > >> >> > Any help in this regard will be greatly appreciated.  Thanks.
> > >> >>
> > >> >>
> > >>
> >
>

--
https://github.com/bearrito
@deepbearrito
+
Hailey Charlie 2013-07-31, 05:32