Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> save several 64MB files in Pig Latin


Copy link to this message
-
Re: save several 64MB files in Pig Latin
Hi, Pedro:
Basically how many splits of files depends on how many reducer you have in
your Pig job. So if total result data size is 100MB, and you have 10
reducers, you will get 10 files and each file with 10MB. Bertrand's pointer
is about specify number of reducer for your Pig job.

Johnny
On Sun, Jun 9, 2013 at 10:42 PM, Pedro Sá da Costa <[EMAIL PROTECTED]>wrote:

> I don't understand why my purpose is not clear. The previous e-mails
> explain it very clearly.  I want to split a 500MB single txt in HDFS into
> multiple files using Pig latin. Is it possible? E.g.,
>
> A = LOAD ‘myfile.txt’ USING PigStorage() AS (t);
> STORE A INTO ‘multiplefiles’ USING PigStorage(); -- and here creates
> multiple file with a specific size
>
>
>
>
> On 10 June 2013 07:29, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:
>
> > The purpose is not really clear. But if you are looking for how to
> specify
> > multiple Reducer task, it is well explained in the documentation.
> > http://pig.apache.org/docs/r0.11.1/perf.html#parallel
> >
> > You will get one file per reducer. It is up to you to specify the right
> > number but be careful of not falling into the small files problem in the
> > end.
> > http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
> >
> > If you have specific question on HDFS itself or pig optimisation, you
> > should provide more explanation.
> > (64MB is the default block size for HDFS)
> >
> > Regard
> >
> > Bertrand
> >
> >
> > On Mon, Jun 10, 2013 at 6:53 AM, Pedro Sá da Costa <[EMAIL PROTECTED]
> > >wrote:
> >
> > > I said 64MB, but it can be 128MB, or 5KB. It doesn't matter the
> number. I
> > > just want to extract data and put into several files with specific
> size.
> > > Basically, I am doing a cat to a big txt file, and I want to split the
> > > content into multiple files with a fixed size.
> > >
> > >
> > > On 7 June 2013 10:14, Johnny Zhang <[EMAIL PROTECTED]> wrote:
> > >
> > > > Pedro, you can try Piggybank MultiStorage, which split results into
> > > > different dir/files by specific index attribute. But not sure how it
> > can
> > > > make sure the file size is 64MB. Why 64MB specifically? what's the
> > > > connection between your data and 64MB?
> > > >
> > > > Johnny
> > > >
> > > >
> > > > On Fri, Jun 7, 2013 at 12:56 AM, Pedro Sá da Costa <
> [EMAIL PROTECTED]
> > > > >wrote:
> > > >
> > > > > I am using the instruction:
> > > > >
> > > > > store A into 'result-australia-0' using PigStorage('\t');
> > > > >
> > > > > to store the data in HDFS. But the problem is that, this creates 1
> > file
> > > > > with 500MB of size. Instead, want to save several 64MB files. How I
> > do
> > > > > this?
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> >
> >
> >
> > --
> > Bertrand Dechoux
> >
>
>
>
> --
> Best regards,
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB