Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - pig generated 2 map-only jobs ?


Copy link to this message
-
Re: pig generated 2 map-only jobs ?
Yang 2012-06-17, 16:39
Thanks, Alan, here it is
SET mapred.max.jobs.per.node 1;
SET mapred.max.maps.per.node  8;
SET mapred.tasktracker.map.tasks.maximum   8;
SET mapred.map.tasks 48;
SET mapred.min.split.size  $min_split_size;
SET pig.noSplitCombination true;
SET mapred.map.tasks.speculative.execution false;
SET mapred.reduce.tasks.speculative.execution false;

REGISTER ./myjar.jar;

DEFINE search_index  com.mycompany.SearchUdf();
DEFINE verify_model  com.mycompany.VerifyDataUsingModelUdf();
DEFINE verify_model2  com.mycompany.VerifyDataUsingModelUdf();
suspects = LOAD '$input_suspects' USING PigStorage('\t') AS (
--__START_SCHEMA__
.....
--__END_SCHEMA__
);
similars = FOREACH suspects GENERATE
            *,
            FLATTEN (
            search_index(
                    name,
                    address,
                    city,
                    state,
                    zip,
                    phone
            )) ;
similars = FOREACH similars GENERATE
    *,

    top_10_similars::state AS candidate_state,
    top_10_similars::zip AS candidate_zip,
    top_10_similars::phone AS candidate_phone,
    top_10_similars::profNames AS candidate_profNames,
    top_10_similars::categories AS candidate_categories,
    top_10_similars::cgId AS candidate_cgId,
    top_10_similars::canonName AS candidate_canonName,
    top_10_similars::canonAddress AS candidate_canonAddress,
    top_10_similars::privateId AS candidate_id
;

similars = FILTER similars BY NOT (legacy_ids IS NOT NULL AND
candidate_cgId IS NOT NULL AND legacy_ids != candidate_cgId
                    OR
                    legacy_ids IS NULL AND candidate_cgId IS NULL
                    )
;
bad = FILTER similars BY ( categories is NULL OR categories == '' OR
categories == '6019') ;
good = FILTER similars BY NOT ( categories is NULL OR categories == '' OR
categories == '6019') ;

verdict1 = FOREACH good GENERATE
    *,

    verify_model( name,
    address,
    city,
    .....
    )

;

verdict2 = FOREACH bad GENERATE
    *,

    verify_model2(
    name,
    address,
    city,
    )
;

verdict = UNION verdict1, verdict2;
STORE verdict INTO '$output';
On Sat, Jun 16, 2012 at 11:51 PM, Alan Gates <[EMAIL PROTECTED]> wrote:

> Apache mailing lists strip all attachments.  You'll have to inline the
> script in your message or post it somewhere and send a link.
>
> Alan.
>
> On Jun 16, 2012, at 9:06 PM, Yang wrote:
>
> > Thanks Alan.
> >
> >
> > I attached the trimmed version of my script .
> >
> >
> > basically the similars var generates a bag, explodes it, after that,
> each of the output record is filtered through a Udf.
> >
> > I suspect that the 2 maps are due to the explosion. but it should be
> possible to put the above sequence into a single map.
> >
> >
> > Yang
> >
> > On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <[EMAIL PROTECTED]>
> wrote:
> > There are cases where it would do this, such as unioning two inputs.
>  Can you send your script to the list?
> >
> > Alan.
> >
> > On Jun 11, 2012, at 11:21 PM, Yang wrote:
> >
> > > this is what happened with my pig script.
> > > why would it generate 2 map-only jobs?
> > > wouldn't the optimization process chain together both mappers and keep
> only
> > > 1 mapper stage?
> > >
> > >
> > > thanks
> > > Yang
> >
> >
>
>