Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> pig generated 2 map-only jobs ?


Copy link to this message
-
Re: pig generated 2 map-only jobs ?
Feel it should be only one map. Can you do explain? (explain -script xxxx)

On Sun, Jun 17, 2012 at 9:39 AM, Yang <[EMAIL PROTECTED]> wrote:

> Thanks, Alan, here it is
>
>
>
>
> SET mapred.max.jobs.per.node 1;
> SET mapred.max.maps.per.node  8;
> SET mapred.tasktracker.map.tasks.maximum   8;
> SET mapred.map.tasks 48;
> SET mapred.min.split.size  $min_split_size;
> SET pig.noSplitCombination true;
> SET mapred.map.tasks.speculative.execution false;
> SET mapred.reduce.tasks.speculative.execution false;
>
>
>
>
>
> REGISTER ./myjar.jar;
>
> DEFINE search_index  com.mycompany.SearchUdf();
> DEFINE verify_model  com.mycompany.VerifyDataUsingModelUdf();
> DEFINE verify_model2  com.mycompany.VerifyDataUsingModelUdf();
>
>
> suspects = LOAD '$input_suspects' USING PigStorage('\t') AS (
> --__START_SCHEMA__
> .....
> --__END_SCHEMA__
> );
>
>
>
>
> similars = FOREACH suspects GENERATE
>            *,
>            FLATTEN (
>            search_index(
>                    name,
>                    address,
>                    city,
>                    state,
>                    zip,
>                    phone
>            )) ;
>
>
> similars = FOREACH similars GENERATE
>    *,
>
>    top_10_similars::state AS candidate_state,
>    top_10_similars::zip AS candidate_zip,
>    top_10_similars::phone AS candidate_phone,
>    top_10_similars::profNames AS candidate_profNames,
>    top_10_similars::categories AS candidate_categories,
>    top_10_similars::cgId AS candidate_cgId,
>    top_10_similars::canonName AS candidate_canonName,
>    top_10_similars::canonAddress AS candidate_canonAddress,
>    top_10_similars::privateId AS candidate_id
> ;
>
> similars = FILTER similars BY NOT (legacy_ids IS NOT NULL AND
> candidate_cgId IS NOT NULL AND legacy_ids != candidate_cgId
>                    OR
>                    legacy_ids IS NULL AND candidate_cgId IS NULL
>                    )
> ;
>
>
> bad = FILTER similars BY ( categories is NULL OR categories == '' OR
> categories == '6019') ;
> good = FILTER similars BY NOT ( categories is NULL OR categories == '' OR
> categories == '6019') ;
>
> verdict1 = FOREACH good GENERATE
>    *,
>
>    verify_model( name,
>    address,
>    city,
>    .....
>    )
>
> ;
>
> verdict2 = FOREACH bad GENERATE
>    *,
>
>    verify_model2(
>    name,
>    address,
>    city,
>    )
> ;
>
>
>
> verdict = UNION verdict1, verdict2;
> STORE verdict INTO '$output';
>
>
> On Sat, Jun 16, 2012 at 11:51 PM, Alan Gates <[EMAIL PROTECTED]>
> wrote:
>
> > Apache mailing lists strip all attachments.  You'll have to inline the
> > script in your message or post it somewhere and send a link.
> >
> > Alan.
> >
> > On Jun 16, 2012, at 9:06 PM, Yang wrote:
> >
> > > Thanks Alan.
> > >
> > >
> > > I attached the trimmed version of my script .
> > >
> > >
> > > basically the similars var generates a bag, explodes it, after that,
> > each of the output record is filtered through a Udf.
> > >
> > > I suspect that the 2 maps are due to the explosion. but it should be
> > possible to put the above sequence into a single map.
> > >
> > >
> > > Yang
> > >
> > > On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <[EMAIL PROTECTED]>
> > wrote:
> > > There are cases where it would do this, such as unioning two inputs.
> >  Can you send your script to the list?
> > >
> > > Alan.
> > >
> > > On Jun 11, 2012, at 11:21 PM, Yang wrote:
> > >
> > > > this is what happened with my pig script.
> > > > why would it generate 2 map-only jobs?
> > > > wouldn't the optimization process chain together both mappers and
> keep
> > only
> > > > 1 mapper stage?
> > > >
> > > >
> > > > thanks
> > > > Yang
> > >
> > >
> >
> >
>