Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Parallelism for small input data


+
Dipesh Kumar Singh 2013-01-13, 12:47
+
Dmitriy Ryaboy 2013-01-13, 22:54
Copy link to this message
-
Re: Parallelism for small input data
Well, if you will set split size to 1, you should get per-line split.
2013/1/13 Dipesh Kumar Singh <[EMAIL PROTECTED]>

> Hello users,
>
> I have an input file (1.2 MB) which contains list of words/phrases in every
> new line. I am reading each phrase per line and passing it to udf to
> correct/check that phrase.
> The udf (simple extends eval func) refers and reads a dictionary file of 6
> MB for each input phrase.
>
> Since, the input dataset is very small, Pig launches only one mapper (out
> of 150 slots) to process the input and no parallelism is gained here.
>
> I would like to get some input/suggestions on how these kind of scenarios
> are efficiently implemented in pig.
>
> =====code snip===>
> register 'Dudfs.jar';
> define CorrectPhrases CorrectPhrases('/user/home/big.txt');
> input_term = load '/user/home/input.txt' using PigStorage('\n') as
> (phrase:chararray);
> checked_term = foreach input_term generate phrase, CorrectPhrases(phrase)
> as correctedTerms;
> store checked_term into '/user/home/corrected_phrases' using
> PigStorage(',');
>
> ==================================>
> Forgive me if i am getting into wrong direction, feel free to correct me
> and suggest your ways.
>
> Thanks in advance!
>
>
> Regards,
> Dipesh
> --
> Dipesh Kr. Singh
>

--
Best regards,
 Vitalii Tymchyshyn
+
Dipesh Kumar Singh 2013-01-15, 18:22