Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Some optimization advices


+
Jerome Pierson 2013-01-31, 17:19
+
Cheolsoo Park 2013-01-31, 19:45
+
Jonathan Coveney 2013-01-31, 23:27
+
Jerome Pierson 2013-02-05, 16:06
+
Cheolsoo Park 2013-02-05, 22:13
+
Jerome Person 2013-02-05, 22:57
Copy link to this message
-
Re: Some optimization advices
Is this a gzip file? You have to make sure the compression scheme you use
is splittable for more mappers to be spawned.

-Prashant

On Tue, Feb 5, 2013 at 2:57 PM, Jerome Person <[EMAIL PROTECTED]>wrote:

> As it is a 50 Gb single file, I believe this job need more than one
> mapper.
>
> I do not find any mapred.max.split.size parameter in the job
> configuration xml file (only mapred.min.split.size = 0).
>
> Is there any "key word" to activate parallelism into the pig script ?
>
> Jérôme.
>
> Le Tue, 5 Feb 2013 14:13:32 -0800,
> Cheolsoo Park <[EMAIL PROTECTED]> a écrit :
>
> > >> But one more point, I have only one mapper running with this pig
> > >> job as
> > my cluster has 4 slaves. How could it be different ?
> >
> > Are you asking why only a single mapper runs even though there are 3
> > more slaves available? 4 slaves doesn't mean that you will always
> > have 4 mappers/reducers. Hadoop launches a mapper per file split.
> >
> > How many input file do you have?
> >
> > - If you have just one small file, Pig will launch a single mapper.
> > You can increase parallelism by splitting that file into smaller
> > splits:
> >
> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
> >
> > - If you have many small files, Pig will combine them into a single
> > split and launch a single mapper. This case, you might want to change
> > pig.maxCombinedSplitSize:
> > http://pig.apache.org/docs/r0.10.0/perf.html#combine-files
> >
> > Thanks,
> > Cheolsoo
> >
> > On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson
> > <[EMAIL PROTECTED]>wrote:
> >
> > > Thaks a lot. It works fine.
> > >
> > > But one more point, I have only one mapper running with this pig
> > > job as my cluster has 4 slaves.
> > > How could it be different ?
> > >
> > > Regards,
> > > Jérôme
> > >
> > >
> > > Le 31/01/2013 20:45, Cheolsoo Park a écrit :
> > >
> > >> Hi Jerome,
> > >>
> > >> Try this:
> > >>
> > >> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0);
> > >> XmlTag2 = FOREACH XmlTag {
> > >>      tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity');
> > >>      GENERATE *, COUNT(tag_with_amenity) AS count;
> > >> };
> > >> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE
> > >> node_attr_id, node_attr_lon, node_attr_lat, tag;
> > >>
> > >> Thanks,
> > >> Cheolsoo
> > >>
> > >>
> > >> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson
> > >> <[EMAIL PROTECTED]>**wrote:
> > >>
> > >>  Hi There,
> > >>>
> > >>> I am a beginner, I achieved something, but I guess I could have
> > >>> done better. Let me explain.
> > >>> (Pig 0.10)
> > >>>
> > >>> My data is DESCRIBE as :
> > >>>
> > >>>   xmlToTuple: {(node_attr_id: int,node_attr_lon:
> > >>> chararray,node_attr_lat: chararray,tag: {(tag_attr_k:
> > >>> chararray,tag_attr_v: chararray)})}
> > >>>
> > >>>
> > >>> and DUMP like this :
> > >>>
> > >>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)}))
> > >>> ((100948454,45.2620946,-12.****7849171,))
> > >>> ((100948519,45.2356985,-12.****7707014,{(created_by,JOSM)}))
> > >>> ((704398904,45.2416667,-13.****0058333,{(lat,-13.00583333),(****
> > >>> lon,45.24166667)}))
> > >>> ((1230941976,45.0743117,-12.****6888807,{(place,village)}))
> > >>> ((1230941977,45.0832807,-12.****6810328,{(name,Mtsahara)}))
> > >>> ((1976927219,45.2272263,-12.****7794359,))
> > >>> ((1751057677,45.2216163,-12.****7825896,{(amenity,fast_food),(****
> > >>> name,Brochetterie)}))
> > >>> ((1751057678,45.2216953,-12.****7829678,{(amenity,fast_food),(****
> > >>> name,Brochetterie)}))
> > >>>
> ((100948360,45.2338541,-12.****7762230,{(amenity,ferry_****terminal)}))
> > >>> ((362795028,45.2086809,-12.****8062991,{(amenity,fuel),(****
> > >>> operator,Total)}))
> > >>>
> > >>>
> > >>> I want to extract the record which have a certain value for the
> > >>> tag_attr_k
> > >>> field. For example, give me the record where there is a
> > >>> tag_attr_k = amesity ? That should be :
> > >>>
> > >>> (100948360,-12.7762230,45.****2338541,{(amenity,ferry_****terminal)})
+
Jerome Person 2013-02-06, 10:00
+
Cheolsoo Park 2013-02-06, 16:41
+
Jerome Person 2013-02-06, 16:55
+
psic 2013-02-05, 22:57