Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> pig generated 2 map-only jobs ?


+
Yang 2012-06-12, 06:21
+
Alan Gates 2012-06-12, 21:14
+
Yang 2012-06-17, 04:06
+
Alan Gates 2012-06-17, 06:51
+
Yang 2012-06-17, 16:39
Copy link to this message
-
Re: pig generated 2 map-only jobs ?
Feel it should be only one map. Can you do explain? (explain -script xxxx)

On Sun, Jun 17, 2012 at 9:39 AM, Yang <[EMAIL PROTECTED]> wrote:

> Thanks, Alan, here it is
>
>
>
>
> SET mapred.max.jobs.per.node 1;
> SET mapred.max.maps.per.node  8;
> SET mapred.tasktracker.map.tasks.maximum   8;
> SET mapred.map.tasks 48;
> SET mapred.min.split.size  $min_split_size;
> SET pig.noSplitCombination true;
> SET mapred.map.tasks.speculative.execution false;
> SET mapred.reduce.tasks.speculative.execution false;
>
>
>
>
>
> REGISTER ./myjar.jar;
>
> DEFINE search_index  com.mycompany.SearchUdf();
> DEFINE verify_model  com.mycompany.VerifyDataUsingModelUdf();
> DEFINE verify_model2  com.mycompany.VerifyDataUsingModelUdf();
>
>
> suspects = LOAD '$input_suspects' USING PigStorage('\t') AS (
> --__START_SCHEMA__
> .....
> --__END_SCHEMA__
> );
>
>
>
>
> similars = FOREACH suspects GENERATE
>            *,
>            FLATTEN (
>            search_index(
>                    name,
>                    address,
>                    city,
>                    state,
>                    zip,
>                    phone
>            )) ;
>
>
> similars = FOREACH similars GENERATE
>    *,
>
>    top_10_similars::state AS candidate_state,
>    top_10_similars::zip AS candidate_zip,
>    top_10_similars::phone AS candidate_phone,
>    top_10_similars::profNames AS candidate_profNames,
>    top_10_similars::categories AS candidate_categories,
>    top_10_similars::cgId AS candidate_cgId,
>    top_10_similars::canonName AS candidate_canonName,
>    top_10_similars::canonAddress AS candidate_canonAddress,
>    top_10_similars::privateId AS candidate_id
> ;
>
> similars = FILTER similars BY NOT (legacy_ids IS NOT NULL AND
> candidate_cgId IS NOT NULL AND legacy_ids != candidate_cgId
>                    OR
>                    legacy_ids IS NULL AND candidate_cgId IS NULL
>                    )
> ;
>
>
> bad = FILTER similars BY ( categories is NULL OR categories == '' OR
> categories == '6019') ;
> good = FILTER similars BY NOT ( categories is NULL OR categories == '' OR
> categories == '6019') ;
>
> verdict1 = FOREACH good GENERATE
>    *,
>
>    verify_model( name,
>    address,
>    city,
>    .....
>    )
>
> ;
>
> verdict2 = FOREACH bad GENERATE
>    *,
>
>    verify_model2(
>    name,
>    address,
>    city,
>    )
> ;
>
>
>
> verdict = UNION verdict1, verdict2;
> STORE verdict INTO '$output';
>
>
> On Sat, Jun 16, 2012 at 11:51 PM, Alan Gates <[EMAIL PROTECTED]>
> wrote:
>
> > Apache mailing lists strip all attachments.  You'll have to inline the
> > script in your message or post it somewhere and send a link.
> >
> > Alan.
> >
> > On Jun 16, 2012, at 9:06 PM, Yang wrote:
> >
> > > Thanks Alan.
> > >
> > >
> > > I attached the trimmed version of my script .
> > >
> > >
> > > basically the similars var generates a bag, explodes it, after that,
> > each of the output record is filtered through a Udf.
> > >
> > > I suspect that the 2 maps are due to the explosion. but it should be
> > possible to put the above sequence into a single map.
> > >
> > >
> > > Yang
> > >
> > > On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <[EMAIL PROTECTED]>
> > wrote:
> > > There are cases where it would do this, such as unioning two inputs.
> >  Can you send your script to the list?
> > >
> > > Alan.
> > >
> > > On Jun 11, 2012, at 11:21 PM, Yang wrote:
> > >
> > > > this is what happened with my pig script.
> > > > why would it generate 2 map-only jobs?
> > > > wouldn't the optimization process chain together both mappers and
> keep
> > only
> > > > 1 mapper stage?
> > > >
> > > >
> > > > thanks
> > > > Yang
> > >
> > >
> >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB