Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - save several 64MB files in Pig Latin


+
Pedro Sá da Costa 2013-06-07, 07:56
+
Johnny Zhang 2013-06-07, 08:14
+
Pedro Sá da Costa 2013-06-10, 04:53
+
Bertrand Dechoux 2013-06-10, 05:29
+
Pedro Sá da Costa 2013-06-10, 05:42
+
Johnny Zhang 2013-06-10, 06:58
+
Bertrand Dechoux 2013-06-10, 09:21
+
Pedro Sá da Costa 2013-06-10, 09:36
Copy link to this message
-
Re: save several 64MB files in Pig Latin
Ruslan Al-Fakikh 2013-06-10, 11:34
Hi Pedro,

Yes, Pig Latin is always compiled to MapReduce.
Usually you don't have to specify the number of mappers (I am not sure
whether you really can). If you have a file of 500MB and it is splittable
then the number of mappers is automatically equals to 500MB / 64MB (block
size) which is around 8. Here I assume that you have the default block size
of 64MB. If your file is not splittable then the whole file will go to one
mapper :(

Let me know if you have further questions.

Thanks
On Mon, Jun 10, 2013 at 1:36 PM, Pedro Sá da Costa <[EMAIL PROTECTED]>wrote:

> Yes, I understand the previous answers now. The reason of my question is
> because I was trying to "split" a file with pig latin by loading the file
> and writing portions of the file again in HDFS. With both replies, it seems
> that pig latin uses mapreduce to compute the scripts, correct?
>
> And in map reduce, if I have one file with 500MB size, and I run an example
> with 10 maps (we forget the reducers now), it means that each map will read
> more or less 50MB?
>
>
>
> On 10 June 2013 11:21, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:
>
> > I wasn't clear. Specifying the size of the files is not your real aim, I
> > guess. But you think that's what is needed in order to solve your problem
> > that we don't know about. 500MB is not a really big file in itself and is
> > not an issue for HDFS and MapReduce.
> >
> > There is no absolute way to know how much data a reducer will produce
> given
> > its input because it depends on the implementation. In order to have a
> > simple life cycle, each Reducer will write its own file. So if you want
> to
> > have smaller files, you will need to increase the number of Reducer.
> (Same
> > size + more files -> smaller files) However there is no way to have files
> > with an exact size. One of the obvious reason is because you would need
> to
> > break a record (key/value) into two files. And there no reconciliation
> > strategy for that. It does happen between blocks of a file but blocks of
> a
> > file are ordered so the RecordReader knows how to deal with it.
> >
> > Bertrand
> >
> > On Mon, Jun 10, 2013 at 8:58 AM, Johnny Zhang <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi, Pedro:
> > > Basically how many splits of files depends on how many reducer you have
> > in
> > > your Pig job. So if total result data size is 100MB, and you have 10
> > > reducers, you will get 10 files and each file with 10MB. Bertrand's
> > pointer
> > > is about specify number of reducer for your Pig job.
> > >
> > > Johnny
> > >
> > >
> > > On Sun, Jun 9, 2013 at 10:42 PM, Pedro Sá da Costa <[EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > I don't understand why my purpose is not clear. The previous e-mails
> > > > explain it very clearly.  I want to split a 500MB single txt in HDFS
> > into
> > > > multiple files using Pig latin. Is it possible? E.g.,
> > > >
> > > > A = LOAD ‘myfile.txt’ USING PigStorage() AS (t);
> > > > STORE A INTO ‘multiplefiles’ USING PigStorage(); -- and here creates
> > > > multiple file with a specific size
> > > >
> > > >
> > > >
> > > >
> > > > On 10 June 2013 07:29, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > The purpose is not really clear. But if you are looking for how to
> > > > specify
> > > > > multiple Reducer task, it is well explained in the documentation.
> > > > > http://pig.apache.org/docs/r0.11.1/perf.html#parallel
> > > > >
> > > > > You will get one file per reducer. It is up to you to specify the
> > right
> > > > > number but be careful of not falling into the small files problem
> in
> > > the
> > > > > end.
> > > > > http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
> > > > >
> > > > > If you have specific question on HDFS itself or pig optimisation,
> you
> > > > > should provide more explanation.
> > > > > (64MB is the default block size for HDFS)
> > > > >
> > > > > Regard
> > > > >
> > > > > Bertrand
> > > > >
> > > > >
> > > > > On Mon, Jun 10, 2013 at 6:53 AM, Pedro Sá da Costa <
+
Bertrand Dechoux 2013-06-10, 11:42