Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Merging files


+
Something Something 2013-07-31, 05:26
+
Ben Juhn 2013-07-31, 05:34
+
Something Something 2013-07-31, 06:40
+
John Meagher 2013-07-31, 13:28
+
Something Something 2013-07-31, 16:21
Can't you solve for the  --max-file-blocks option given that you know the
sizes of the input files and desired number of outputfiles?
On Wed, Jul 31, 2013 at 12:21 PM, Something Something <
[EMAIL PROTECTED]> wrote:

> Thanks, John.  But I don't see an option to specify the # of output files.
>  How does Crush decide how many files to create?  Is it only based on file
> sizes?
>
> On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <[EMAIL PROTECTED]
> >wrote:
>
> > Here's a great tool for handling exactly that case:
> > https://github.com/edwardcapriolo/filecrush
> >
> > On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> > <[EMAIL PROTECTED]> wrote:
> > > Each bz2 file after merging is about 50Megs.  The reducers take about 9
> > > minutes.
> > >
> > > Note:  'getmerge' is not an option.  There isn't enough disk space to
> do
> > a
> > > getmerge on the local production box.  Plus we need a scalable solution
> > as
> > > these files will get a lot bigger soon.
> > >
> > > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <[EMAIL PROTECTED]>
> wrote:
> > >
> > >> How big are your 50 files?  How long are the reducers taking?
> > >>
> > >> On Jul 30, 2013, at 10:26 PM, Something Something <
> > >> [EMAIL PROTECTED]> wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > One of our pig scripts creates over 500 small part files.  To save
> on
> > >> > namespace, we need to cut down the # of files, so instead of saving
> > 500
> > >> > small files we need to merge them into 50.  We tried the following:
> > >> >
> > >> > 1)  When we set parallel number to 50, the Pig script takes a long
> > time -
> > >> > for obvious reasons.
> > >> > 2)  If we use Hadoop Streaming, it puts some garbage values into the
> > key
> > >> > field.
> > >> > 3)  We wrote our own Map Reducer program that reads these 500 small
> > part
> > >> > files & uses 50 reducers.  Basically, the Mappers simply write the
> > line &
> > >> > reducers loop thru values & write them out.  We set
> > >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
> > written
> > >> to
> > >> > the output file.  This is performing better than Pig.  Actually
> > Mappers
> > >> run
> > >> > very fast, but Reducers take some time to complete, but this
> approach
> > >> seems
> > >> > to be working well.
> > >> >
> > >> > Is there a better way to do this?  What strategy can you think of to
> > >> > increase speed of reducers.
> > >> >
> > >> > Any help in this regard will be greatly appreciated.  Thanks.
> > >>
> > >>
> >
>

--
https://github.com/bearrito
@deepbearrito
+
John Meagher 2013-07-31, 17:28
+
Something Something 2013-07-31, 20:39
+
j.barrett Strausser 2013-07-31, 21:01
+
Hailey Charlie 2013-07-31, 05:32