Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Some optimization advices


+
Jerome Pierson 2013-01-31, 17:19
+
Cheolsoo Park 2013-01-31, 19:45
+
Jonathan Coveney 2013-01-31, 23:27
+
Jerome Pierson 2013-02-05, 16:06
+
Cheolsoo Park 2013-02-05, 22:13
+
Jerome Person 2013-02-05, 22:57
+
Prashant Kommireddi 2013-02-05, 23:10
+
Jerome Person 2013-02-06, 10:00
Copy link to this message
-
Re: Some optimization advices
Hi Jerome,

It's not Pig but Hadoop that splits input files. Pig Load/Store UDFs are
associated with InputFormat, OutputFormat and RecordReader classes. Hadoop
uses them to decide how to creates splits. Here are more explanations:
http://www.quora.com/How-does-Hadoop-handle-split-input-records

Thanks,
Cheolsoo
On Wed, Feb 6, 2013 at 2:00 AM, Jerome Person <[EMAIL PROTECTED]>wrote:

> It is not a gzip file. It is an XML file which is load with an UDF.
> When does pig split the input file.
> I guess my loader is wrong ?
>
> Jérôme.
>
>
> Le Tue, 5 Feb 2013 15:10:14 -0800,
> Prashant Kommireddi <[EMAIL PROTECTED]> a écrit :
>
> > Is this a gzip file? You have to make sure the compression scheme you
> > use is splittable for more mappers to be spawned.
> >
> > -Prashant
> >
> > On Tue, Feb 5, 2013 at 2:57 PM, Jerome Person
> > <[EMAIL PROTECTED]>wrote:
> >
> > > As it is a 50 Gb single file, I believe this job need more than one
> > > mapper.
> > >
> > > I do not find any mapred.max.split.size parameter in the job
> > > configuration xml file (only mapred.min.split.size = 0).
> > >
> > > Is there any "key word" to activate parallelism into the pig
> > > script ?
> > >
> > > Jérôme.
> > >
> > > Le Tue, 5 Feb 2013 14:13:32 -0800,
> > > Cheolsoo Park <[EMAIL PROTECTED]> a écrit :
> > >
> > > > >> But one more point, I have only one mapper running with this
> > > > >> pig job as
> > > > my cluster has 4 slaves. How could it be different ?
> > > >
> > > > Are you asking why only a single mapper runs even though there
> > > > are 3 more slaves available? 4 slaves doesn't mean that you will
> > > > always have 4 mappers/reducers. Hadoop launches a mapper per file
> > > > split.
> > > >
> > > > How many input file do you have?
> > > >
> > > > - If you have just one small file, Pig will launch a single
> > > > mapper. You can increase parallelism by splitting that file into
> > > > smaller splits:
> > > >
> > >
> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
> > > >
> > > > - If you have many small files, Pig will combine them into a
> > > > single split and launch a single mapper. This case, you might
> > > > want to change pig.maxCombinedSplitSize:
> > > > http://pig.apache.org/docs/r0.10.0/perf.html#combine-files
> > > >
> > > > Thanks,
> > > > Cheolsoo
> > > >
> > > > On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson
> > > > <[EMAIL PROTECTED]>wrote:
> > > >
> > > > > Thaks a lot. It works fine.
> > > > >
> > > > > But one more point, I have only one mapper running with this pig
> > > > > job as my cluster has 4 slaves.
> > > > > How could it be different ?
> > > > >
> > > > > Regards,
> > > > > Jérôme
> > > > >
> > > > >
> > > > > Le 31/01/2013 20:45, Cheolsoo Park a écrit :
> > > > >
> > > > >> Hi Jerome,
> > > > >>
> > > > >> Try this:
> > > > >>
> > > > >> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0);
> > > > >> XmlTag2 = FOREACH XmlTag {
> > > > >>      tag_with_amenity = FILTER tag BY (tag_attr_k => > > > >> 'amenity'); GENERATE *, COUNT(tag_with_amenity) AS count;
> > > > >> };
> > > > >> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE
> > > > >> node_attr_id, node_attr_lon, node_attr_lat, tag;
> > > > >>
> > > > >> Thanks,
> > > > >> Cheolsoo
> > > > >>
> > > > >>
> > > > >> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson
> > > > >> <[EMAIL PROTECTED]>**wrote:
> > > > >>
> > > > >>  Hi There,
> > > > >>>
> > > > >>> I am a beginner, I achieved something, but I guess I could
> > > > >>> have done better. Let me explain.
> > > > >>> (Pig 0.10)
> > > > >>>
> > > > >>> My data is DESCRIBE as :
> > > > >>>
> > > > >>>   xmlToTuple: {(node_attr_id: int,node_attr_lon:
> > > > >>> chararray,node_attr_lat: chararray,tag: {(tag_attr_k:
> > > > >>> chararray,tag_attr_v: chararray)})}
> > > > >>>
> > > > >>>
> > > > >>> and DUMP like this :
> > > > >>>
> > > > >>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)}))
> > > > >>> ((100948454,45.2620946,-12.****7849171,))
+
Jerome Person 2013-02-06, 16:55
+
psic 2013-02-05, 22:57