Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Some optimization advices


Copy link to this message
-
Re: Some optimization advices
Hi Jerome,

It's not Pig but Hadoop that splits input files. Pig Load/Store UDFs are
associated with InputFormat, OutputFormat and RecordReader classes. Hadoop
uses them to decide how to creates splits. Here are more explanations:
http://www.quora.com/How-does-Hadoop-handle-split-input-records

Thanks,
Cheolsoo
On Wed, Feb 6, 2013 at 2:00 AM, Jerome Person <[EMAIL PROTECTED]>wrote:

> It is not a gzip file. It is an XML file which is load with an UDF.
> When does pig split the input file.
> I guess my loader is wrong ?
>
> Jérôme.
>
>
> Le Tue, 5 Feb 2013 15:10:14 -0800,
> Prashant Kommireddi <[EMAIL PROTECTED]> a écrit :
>
> > Is this a gzip file? You have to make sure the compression scheme you
> > use is splittable for more mappers to be spawned.
> >
> > -Prashant
> >
> > On Tue, Feb 5, 2013 at 2:57 PM, Jerome Person
> > <[EMAIL PROTECTED]>wrote:
> >
> > > As it is a 50 Gb single file, I believe this job need more than one
> > > mapper.
> > >
> > > I do not find any mapred.max.split.size parameter in the job
> > > configuration xml file (only mapred.min.split.size = 0).
> > >
> > > Is there any "key word" to activate parallelism into the pig
> > > script ?
> > >
> > > Jérôme.
> > >
> > > Le Tue, 5 Feb 2013 14:13:32 -0800,
> > > Cheolsoo Park <[EMAIL PROTECTED]> a écrit :
> > >
> > > > >> But one more point, I have only one mapper running with this
> > > > >> pig job as
> > > > my cluster has 4 slaves. How could it be different ?
> > > >
> > > > Are you asking why only a single mapper runs even though there
> > > > are 3 more slaves available? 4 slaves doesn't mean that you will
> > > > always have 4 mappers/reducers. Hadoop launches a mapper per file
> > > > split.
> > > >
> > > > How many input file do you have?
> > > >
> > > > - If you have just one small file, Pig will launch a single
> > > > mapper. You can increase parallelism by splitting that file into
> > > > smaller splits:
> > > >
> > >
> http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop
> > > >
> > > > - If you have many small files, Pig will combine them into a
> > > > single split and launch a single mapper. This case, you might
> > > > want to change pig.maxCombinedSplitSize:
> > > > http://pig.apache.org/docs/r0.10.0/perf.html#combine-files
> > > >
> > > > Thanks,
> > > > Cheolsoo
> > > >
> > > > On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson
> > > > <[EMAIL PROTECTED]>wrote:
> > > >
> > > > > Thaks a lot. It works fine.
> > > > >
> > > > > But one more point, I have only one mapper running with this pig
> > > > > job as my cluster has 4 slaves.
> > > > > How could it be different ?
> > > > >
> > > > > Regards,
> > > > > Jérôme
> > > > >
> > > > >
> > > > > Le 31/01/2013 20:45, Cheolsoo Park a écrit :
> > > > >
> > > > >> Hi Jerome,
> > > > >>
> > > > >> Try this:
> > > > >>
> > > > >> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0);
> > > > >> XmlTag2 = FOREACH XmlTag {
> > > > >>      tag_with_amenity = FILTER tag BY (tag_attr_k => > > > >> 'amenity'); GENERATE *, COUNT(tag_with_amenity) AS count;
> > > > >> };
> > > > >> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE
> > > > >> node_attr_id, node_attr_lon, node_attr_lat, tag;
> > > > >>
> > > > >> Thanks,
> > > > >> Cheolsoo
> > > > >>
> > > > >>
> > > > >> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson
> > > > >> <[EMAIL PROTECTED]>**wrote:
> > > > >>
> > > > >>  Hi There,
> > > > >>>
> > > > >>> I am a beginner, I achieved something, but I guess I could
> > > > >>> have done better. Let me explain.
> > > > >>> (Pig 0.10)
> > > > >>>
> > > > >>> My data is DESCRIBE as :
> > > > >>>
> > > > >>>   xmlToTuple: {(node_attr_id: int,node_attr_lon:
> > > > >>> chararray,node_attr_lat: chararray,tag: {(tag_attr_k:
> > > > >>> chararray,tag_attr_v: chararray)})}
> > > > >>>
> > > > >>>
> > > > >>> and DUMP like this :
> > > > >>>
> > > > >>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)}))
> > > > >>> ((100948454,45.2620946,-12.****7849171,))
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB