|
|
-
pig generated 2 map-only jobs ?
Yang 2012-06-12, 06:21
this is what happened with my pig script. why would it generate 2 map-only jobs? wouldn't the optimization process chain together both mappers and keep only 1 mapper stage? thanks Yang
-
Re: pig generated 2 map-only jobs ?
Alan Gates 2012-06-12, 21:14
There are cases where it would do this, such as unioning two inputs. Can you send your script to the list?
Alan.
On Jun 11, 2012, at 11:21 PM, Yang wrote:
> this is what happened with my pig script. > why would it generate 2 map-only jobs? > wouldn't the optimization process chain together both mappers and keep only > 1 mapper stage? > > > thanks > Yang
+
Alan Gates 2012-06-12, 21:14
-
Re: pig generated 2 map-only jobs ?
Yang 2012-06-17, 04:06
Thanks Alan. I attached the trimmed version of my script . basically the similars var generates a bag, explodes it, after that, each of the output record is filtered through a Udf.
I suspect that the 2 maps are due to the explosion. but it should be possible to put the above sequence into a single map. Yang
On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <[EMAIL PROTECTED]> wrote:
> There are cases where it would do this, such as unioning two inputs. Can > you send your script to the list? > > Alan. > > On Jun 11, 2012, at 11:21 PM, Yang wrote: > > > this is what happened with my pig script. > > why would it generate 2 map-only jobs? > > wouldn't the optimization process chain together both mappers and keep > only > > 1 mapper stage? > > > > > > thanks > > Yang > >
-
Re: pig generated 2 map-only jobs ?
Alan Gates 2012-06-17, 06:51
Apache mailing lists strip all attachments. You'll have to inline the script in your message or post it somewhere and send a link.
Alan.
On Jun 16, 2012, at 9:06 PM, Yang wrote:
> Thanks Alan. > > > I attached the trimmed version of my script . > > > basically the similars var generates a bag, explodes it, after that, each of the output record is filtered through a Udf. > > I suspect that the 2 maps are due to the explosion. but it should be possible to put the above sequence into a single map. > > > Yang > > On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <[EMAIL PROTECTED]> wrote: > There are cases where it would do this, such as unioning two inputs. Can you send your script to the list? > > Alan. > > On Jun 11, 2012, at 11:21 PM, Yang wrote: > > > this is what happened with my pig script. > > why would it generate 2 map-only jobs? > > wouldn't the optimization process chain together both mappers and keep only > > 1 mapper stage? > > > > > > thanks > > Yang > >
+
Alan Gates 2012-06-17, 06:51
-
Re: pig generated 2 map-only jobs ?
Yang 2012-06-17, 16:39
Thanks, Alan, here it is SET mapred.max.jobs.per.node 1; SET mapred.max.maps.per.node 8; SET mapred.tasktracker.map.tasks.maximum 8; SET mapred.map.tasks 48; SET mapred.min.split.size $min_split_size; SET pig.noSplitCombination true; SET mapred.map.tasks.speculative.execution false; SET mapred.reduce.tasks.speculative.execution false;
REGISTER ./myjar.jar;
DEFINE search_index com.mycompany.SearchUdf(); DEFINE verify_model com.mycompany.VerifyDataUsingModelUdf(); DEFINE verify_model2 com.mycompany.VerifyDataUsingModelUdf(); suspects = LOAD '$input_suspects' USING PigStorage('\t') AS ( --__START_SCHEMA__ ..... --__END_SCHEMA__ ); similars = FOREACH suspects GENERATE *, FLATTEN ( search_index( name, address, city, state, zip, phone )) ; similars = FOREACH similars GENERATE *,
top_10_similars::state AS candidate_state, top_10_similars::zip AS candidate_zip, top_10_similars::phone AS candidate_phone, top_10_similars::profNames AS candidate_profNames, top_10_similars::categories AS candidate_categories, top_10_similars::cgId AS candidate_cgId, top_10_similars::canonName AS candidate_canonName, top_10_similars::canonAddress AS candidate_canonAddress, top_10_similars::privateId AS candidate_id ;
similars = FILTER similars BY NOT (legacy_ids IS NOT NULL AND candidate_cgId IS NOT NULL AND legacy_ids != candidate_cgId OR legacy_ids IS NULL AND candidate_cgId IS NULL ) ; bad = FILTER similars BY ( categories is NULL OR categories == '' OR categories == '6019') ; good = FILTER similars BY NOT ( categories is NULL OR categories == '' OR categories == '6019') ;
verdict1 = FOREACH good GENERATE *,
verify_model( name, address, city, ..... )
;
verdict2 = FOREACH bad GENERATE *,
verify_model2( name, address, city, ) ;
verdict = UNION verdict1, verdict2; STORE verdict INTO '$output'; On Sat, Jun 16, 2012 at 11:51 PM, Alan Gates <[EMAIL PROTECTED]> wrote:
> Apache mailing lists strip all attachments. You'll have to inline the > script in your message or post it somewhere and send a link. > > Alan. > > On Jun 16, 2012, at 9:06 PM, Yang wrote: > > > Thanks Alan. > > > > > > I attached the trimmed version of my script . > > > > > > basically the similars var generates a bag, explodes it, after that, > each of the output record is filtered through a Udf. > > > > I suspect that the 2 maps are due to the explosion. but it should be > possible to put the above sequence into a single map. > > > > > > Yang > > > > On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <[EMAIL PROTECTED]> > wrote: > > There are cases where it would do this, such as unioning two inputs. > Can you send your script to the list? > > > > Alan. > > > > On Jun 11, 2012, at 11:21 PM, Yang wrote: > > > > > this is what happened with my pig script. > > > why would it generate 2 map-only jobs? > > > wouldn't the optimization process chain together both mappers and keep > only > > > 1 mapper stage? > > > > > > > > > thanks > > > Yang > > > > > >
-
Re: pig generated 2 map-only jobs ?
Daniel Dai 2012-06-18, 02:00
Feel it should be only one map. Can you do explain? (explain -script xxxx)
On Sun, Jun 17, 2012 at 9:39 AM, Yang <[EMAIL PROTECTED]> wrote:
> Thanks, Alan, here it is > > > > > SET mapred.max.jobs.per.node 1; > SET mapred.max.maps.per.node 8; > SET mapred.tasktracker.map.tasks.maximum 8; > SET mapred.map.tasks 48; > SET mapred.min.split.size $min_split_size; > SET pig.noSplitCombination true; > SET mapred.map.tasks.speculative.execution false; > SET mapred.reduce.tasks.speculative.execution false; > > > > > > REGISTER ./myjar.jar; > > DEFINE search_index com.mycompany.SearchUdf(); > DEFINE verify_model com.mycompany.VerifyDataUsingModelUdf(); > DEFINE verify_model2 com.mycompany.VerifyDataUsingModelUdf(); > > > suspects = LOAD '$input_suspects' USING PigStorage('\t') AS ( > --__START_SCHEMA__ > ..... > --__END_SCHEMA__ > ); > > > > > similars = FOREACH suspects GENERATE > *, > FLATTEN ( > search_index( > name, > address, > city, > state, > zip, > phone > )) ; > > > similars = FOREACH similars GENERATE > *, > > top_10_similars::state AS candidate_state, > top_10_similars::zip AS candidate_zip, > top_10_similars::phone AS candidate_phone, > top_10_similars::profNames AS candidate_profNames, > top_10_similars::categories AS candidate_categories, > top_10_similars::cgId AS candidate_cgId, > top_10_similars::canonName AS candidate_canonName, > top_10_similars::canonAddress AS candidate_canonAddress, > top_10_similars::privateId AS candidate_id > ; > > similars = FILTER similars BY NOT (legacy_ids IS NOT NULL AND > candidate_cgId IS NOT NULL AND legacy_ids != candidate_cgId > OR > legacy_ids IS NULL AND candidate_cgId IS NULL > ) > ; > > > bad = FILTER similars BY ( categories is NULL OR categories == '' OR > categories == '6019') ; > good = FILTER similars BY NOT ( categories is NULL OR categories == '' OR > categories == '6019') ; > > verdict1 = FOREACH good GENERATE > *, > > verify_model( name, > address, > city, > ..... > ) > > ; > > verdict2 = FOREACH bad GENERATE > *, > > verify_model2( > name, > address, > city, > ) > ; > > > > verdict = UNION verdict1, verdict2; > STORE verdict INTO '$output'; > > > On Sat, Jun 16, 2012 at 11:51 PM, Alan Gates <[EMAIL PROTECTED]> > wrote: > > > Apache mailing lists strip all attachments. You'll have to inline the > > script in your message or post it somewhere and send a link. > > > > Alan. > > > > On Jun 16, 2012, at 9:06 PM, Yang wrote: > > > > > Thanks Alan. > > > > > > > > > I attached the trimmed version of my script . > > > > > > > > > basically the similars var generates a bag, explodes it, after that, > > each of the output record is filtered through a Udf. > > > > > > I suspect that the 2 maps are due to the explosion. but it should be > > possible to put the above sequence into a single map. > > > > > > > > > Yang > > > > > > On Tue, Jun 12, 2012 at 2:14 PM, Alan Gates <[EMAIL PROTECTED]> > > wrote: > > > There are cases where it would do this, such as unioning two inputs. > > Can you send your script to the list? > > > > > > Alan. > > > > > > On Jun 11, 2012, at 11:21 PM, Yang wrote: > > > > > > > this is what happened with my pig script. > > > > why would it generate 2 map-only jobs? > > > > wouldn't the optimization process chain together both mappers and > keep > > only > > > > 1 mapper stage? > > > > > > > > > > > > thanks > > > > Yang > > > > > > > > > > >
+
Daniel Dai 2012-06-18, 02:00
|
|