Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Reducers slowing down? (UNCLASSIFIED)

Copy link to this message
Reducers slowing down? (UNCLASSIFIED)
Classification: UNCLASSIFIED
Caveats: NONE

Hello, I'm using pig0.6.0 running the following script on a 27 datanode
cluster running RedHat Enterprise 5.4:

 -- Holds the Pig UDF wrapper around the SecondString SoftTFIDF function
REGISTER /home/CandidateIdentification.jar;

-- SecondString itself
REGISTER /home/secondstring-20060615.jar;

-- |People| ~ 62,500,000 from the English GigaWord 4th Edition
People = LOAD '/data/UniquePeoplePerStory' USING PigStorage(',') AS
(file:chararray, name:chararray);

-- |Actors| ~ 8,000 from the Stanford Movie Database
Actors = LOAD '/data/Actors' USING PigStorage(',') AS (actor:chararray);

-- |ToCompare| ~ 500,000,000,000
ToCompare = CROSS Actors, People PARALLEL 30;
-- Score 'em and store 'em
Results = FOREACH ToCompare GENERATE $0, $1, $2,
ARL.CandidateIdentificationUDF.Similarity($2, $0);

STORE Results INTO '/data/ScoredPeople' USING PigStorage(',');

The first 100,000,000,000 reduce output records were produced in some 25
hours. But after 75 hours it has produced a total of 140,000,000,000
(instead of the 300,000,000,000 I was extrapolating) and seems to be
producing them at a slower and slower rate. What is going on? Did I screw
something up?


Classification: UNCLASSIFIED
Caveats: NONE