Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Some optimization advices


Copy link to this message
-
Re: Some optimization advices
Is this a gzip file? You have to make sure the compression scheme you use
is splittable for more mappers to be spawned.

-Prashant

On Tue, Feb 5, 2013 at 2:57 PM, Jerome Person <[EMAIL PROTECTED]>wrote:

> As it is a 50 Gb single file, I believe this job need more than one
> mapper.
>
> I do not find any mapred.max.split.size parameter in the job
> configuration xml file (only mapred.min.split.size = 0).
>
> Is there any "key word" to activate parallelism into the pig script ?
>
> Jérôme.
>
> Le Tue, 5 Feb 2013 14:13:32 -0800,
> Cheolsoo Park <[EMAIL PROTECTED]> a écrit :
>
> > >> But one more point, I have only one mapper running with this pig
> > >> job as
> > my cluster has 4 slaves. How could it be different ?
> >
> > Are you asking why only a single mapper runs even though there are 3
> > more slaves available? 4 slaves doesn't mean that you will always
> > have 4 mappers/reducers. Hadoop launches a mapper per file split.
> >
> > How many input file do you have?
> >
> > - If you have just one small file, Pig will launch a single mapper.
> > You can increase parallelism by splitting that file into smaller
> > splits:
> >
> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
> >
> > - If you have many small files, Pig will combine them into a single
> > split and launch a single mapper. This case, you might want to change
> > pig.maxCombinedSplitSize:
> > http://pig.apache.org/docs/r0.10.0/perf.html#combine-files
> >
> > Thanks,
> > Cheolsoo
> >
> > On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson
> > <[EMAIL PROTECTED]>wrote:
> >
> > > Thaks a lot. It works fine.
> > >
> > > But one more point, I have only one mapper running with this pig
> > > job as my cluster has 4 slaves.
> > > How could it be different ?
> > >
> > > Regards,
> > > Jérôme
> > >
> > >
> > > Le 31/01/2013 20:45, Cheolsoo Park a écrit :
> > >
> > >> Hi Jerome,
> > >>
> > >> Try this:
> > >>
> > >> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0);
> > >> XmlTag2 = FOREACH XmlTag {
> > >>      tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity');
> > >>      GENERATE *, COUNT(tag_with_amenity) AS count;
> > >> };
> > >> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE
> > >> node_attr_id, node_attr_lon, node_attr_lat, tag;
> > >>
> > >> Thanks,
> > >> Cheolsoo
> > >>
> > >>
> > >> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson
> > >> <[EMAIL PROTECTED]>**wrote:
> > >>
> > >>  Hi There,
> > >>>
> > >>> I am a beginner, I achieved something, but I guess I could have
> > >>> done better. Let me explain.
> > >>> (Pig 0.10)
> > >>>
> > >>> My data is DESCRIBE as :
> > >>>
> > >>>   xmlToTuple: {(node_attr_id: int,node_attr_lon:
> > >>> chararray,node_attr_lat: chararray,tag: {(tag_attr_k:
> > >>> chararray,tag_attr_v: chararray)})}
> > >>>
> > >>>
> > >>> and DUMP like this :
> > >>>
> > >>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)}))
> > >>> ((100948454,45.2620946,-12.****7849171,))
> > >>> ((100948519,45.2356985,-12.****7707014,{(created_by,JOSM)}))
> > >>> ((704398904,45.2416667,-13.****0058333,{(lat,-13.00583333),(****
> > >>> lon,45.24166667)}))
> > >>> ((1230941976,45.0743117,-12.****6888807,{(place,village)}))
> > >>> ((1230941977,45.0832807,-12.****6810328,{(name,Mtsahara)}))
> > >>> ((1976927219,45.2272263,-12.****7794359,))
> > >>> ((1751057677,45.2216163,-12.****7825896,{(amenity,fast_food),(****
> > >>> name,Brochetterie)}))
> > >>> ((1751057678,45.2216953,-12.****7829678,{(amenity,fast_food),(****
> > >>> name,Brochetterie)}))
> > >>>
> ((100948360,45.2338541,-12.****7762230,{(amenity,ferry_****terminal)}))
> > >>> ((362795028,45.2086809,-12.****8062991,{(amenity,fuel),(****
> > >>> operator,Total)}))
> > >>>
> > >>>
> > >>> I want to extract the record which have a certain value for the
> > >>> tag_attr_k
> > >>> field. For example, give me the record where there is a
> > >>> tag_attr_k = amesity ? That should be :
> > >>>
> > >>> (100948360,-12.7762230,45.****2338541,{(amenity,ferry_****terminal)})
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB