Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> save several 64MB files in Pig Latin


+
Pedro Sá da Costa 2013-06-07, 07:56
+
Johnny Zhang 2013-06-07, 08:14
+
Pedro Sá da Costa 2013-06-10, 04:53
+
Bertrand Dechoux 2013-06-10, 05:29
+
Pedro Sá da Costa 2013-06-10, 05:42
+
Johnny Zhang 2013-06-10, 06:58
+
Bertrand Dechoux 2013-06-10, 09:21
+
Pedro Sá da Costa 2013-06-10, 09:36
Copy link to this message
-
Re: save several 64MB files in Pig Latin
Hi Pedro,

Yes, Pig Latin is always compiled to MapReduce.
Usually you don't have to specify the number of mappers (I am not sure
whether you really can). If you have a file of 500MB and it is splittable
then the number of mappers is automatically equals to 500MB / 64MB (block
size) which is around 8. Here I assume that you have the default block size
of 64MB. If your file is not splittable then the whole file will go to one
mapper :(

Let me know if you have further questions.

Thanks
On Mon, Jun 10, 2013 at 1:36 PM, Pedro Sá da Costa <[EMAIL PROTECTED]>wrote:

> Yes, I understand the previous answers now. The reason of my question is
> because I was trying to "split" a file with pig latin by loading the file
> and writing portions of the file again in HDFS. With both replies, it seems
> that pig latin uses mapreduce to compute the scripts, correct?
>
> And in map reduce, if I have one file with 500MB size, and I run an example
> with 10 maps (we forget the reducers now), it means that each map will read
> more or less 50MB?
>
>
>
> On 10 June 2013 11:21, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:
>
> > I wasn't clear. Specifying the size of the files is not your real aim, I
> > guess. But you think that's what is needed in order to solve your problem
> > that we don't know about. 500MB is not a really big file in itself and is
> > not an issue for HDFS and MapReduce.
> >
> > There is no absolute way to know how much data a reducer will produce
> given
> > its input because it depends on the implementation. In order to have a
> > simple life cycle, each Reducer will write its own file. So if you want
> to
> > have smaller files, you will need to increase the number of Reducer.
> (Same
> > size + more files -> smaller files) However there is no way to have files
> > with an exact size. One of the obvious reason is because you would need
> to
> > break a record (key/value) into two files. And there no reconciliation
> > strategy for that. It does happen between blocks of a file but blocks of
> a
> > file are ordered so the RecordReader knows how to deal with it.
> >
> > Bertrand
> >
> > On Mon, Jun 10, 2013 at 8:58 AM, Johnny Zhang <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi, Pedro:
> > > Basically how many splits of files depends on how many reducer you have
> > in
> > > your Pig job. So if total result data size is 100MB, and you have 10
> > > reducers, you will get 10 files and each file with 10MB. Bertrand's
> > pointer
> > > is about specify number of reducer for your Pig job.
> > >
> > > Johnny
> > >
> > >
> > > On Sun, Jun 9, 2013 at 10:42 PM, Pedro Sá da Costa <[EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > I don't understand why my purpose is not clear. The previous e-mails
> > > > explain it very clearly.  I want to split a 500MB single txt in HDFS
> > into
> > > > multiple files using Pig latin. Is it possible? E.g.,
> > > >
> > > > A = LOAD ‘myfile.txt’ USING PigStorage() AS (t);
> > > > STORE A INTO ‘multiplefiles’ USING PigStorage(); -- and here creates
> > > > multiple file with a specific size
> > > >
> > > >
> > > >
> > > >
> > > > On 10 June 2013 07:29, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > The purpose is not really clear. But if you are looking for how to
> > > > specify
> > > > > multiple Reducer task, it is well explained in the documentation.
> > > > > http://pig.apache.org/docs/r0.11.1/perf.html#parallel
> > > > >
> > > > > You will get one file per reducer. It is up to you to specify the
> > right
> > > > > number but be careful of not falling into the small files problem
> in
> > > the
> > > > > end.
> > > > > http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
> > > > >
> > > > > If you have specific question on HDFS itself or pig optimisation,
> you
> > > > > should provide more explanation.
> > > > > (64MB is the default block size for HDFS)
> > > > >
> > > > > Regard
> > > > >
> > > > > Bertrand
> > > > >
> > > > >
> > > > > On Mon, Jun 10, 2013 at 6:53 AM, Pedro Sá da Costa <
+
Bertrand Dechoux 2013-06-10, 11:42
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB