|
|
-
Parallelism for small input data
Dipesh Kumar Singh 2013-01-13, 12:47
Hello users,
I have an input file (1.2 MB) which contains list of words/phrases in every new line. I am reading each phrase per line and passing it to udf to correct/check that phrase. The udf (simple extends eval func) refers and reads a dictionary file of 6 MB for each input phrase.
Since, the input dataset is very small, Pig launches only one mapper (out of 150 slots) to process the input and no parallelism is gained here.
I would like to get some input/suggestions on how these kind of scenarios are efficiently implemented in pig.
=====code snip=== register 'Dudfs.jar'; define CorrectPhrases CorrectPhrases('/user/home/big.txt'); input_term = load '/user/home/input.txt' using PigStorage('\n') as (phrase:chararray); checked_term = foreach input_term generate phrase, CorrectPhrases(phrase) as correctedTerms; store checked_term into '/user/home/corrected_phrases' using PigStorage(',');
================================== Forgive me if i am getting into wrong direction, feel free to correct me and suggest your ways.
Thanks in advance! Regards, Dipesh -- Dipesh Kr. Singh
+
Dipesh Kumar Singh 2013-01-13, 12:47
-
Re: Parallelism for small input data
Dmitriy Ryaboy 2013-01-13, 22:54
"The udf (simple extends eval func) refers and reads a dictionary file of 6 MB for each input phrase."
Any reason to keep re-reading the dictionary instead of just reading it once?
D
On Sun, Jan 13, 2013 at 4:47 AM, Dipesh Kumar Singh <[EMAIL PROTECTED]>wrote:
> The udf (simple extends eval func) refers and reads a dictionary file of 6 > MB for each input phrase. >
+
Dmitriy Ryaboy 2013-01-13, 22:54
-
Re: Parallelism for small input data
Vitalii Tymchyshyn 2013-01-14, 10:22
Well, if you will set split size to 1, you should get per-line split. 2013/1/13 Dipesh Kumar Singh <[EMAIL PROTECTED]>
> Hello users, > > I have an input file (1.2 MB) which contains list of words/phrases in every > new line. I am reading each phrase per line and passing it to udf to > correct/check that phrase. > The udf (simple extends eval func) refers and reads a dictionary file of 6 > MB for each input phrase. > > Since, the input dataset is very small, Pig launches only one mapper (out > of 150 slots) to process the input and no parallelism is gained here. > > I would like to get some input/suggestions on how these kind of scenarios > are efficiently implemented in pig. > > =====code snip===> > register 'Dudfs.jar'; > define CorrectPhrases CorrectPhrases('/user/home/big.txt'); > input_term = load '/user/home/input.txt' using PigStorage('\n') as > (phrase:chararray); > checked_term = foreach input_term generate phrase, CorrectPhrases(phrase) > as correctedTerms; > store checked_term into '/user/home/corrected_phrases' using > PigStorage(','); > > ==================================> > Forgive me if i am getting into wrong direction, feel free to correct me > and suggest your ways. > > Thanks in advance! > > > Regards, > Dipesh > -- > Dipesh Kr. Singh >
-- Best regards, Vitalii Tymchyshyn
+
Vitalii Tymchyshyn 2013-01-14, 10:22
-
Re: Parallelism for small input data
Dipesh Kumar Singh 2013-01-15, 18:22
Thanks Dmitriy and Vitalii... !!
I am able to control number of mappers by setting the split size. And, yes there isn't any reason of re-reading the dictionary, except that i was porting an existing code. I will re-implement to read it once and check the performance.
Regards, Dipesh
On Mon, Jan 14, 2013 at 3:52 PM, Vitalii Tymchyshyn <[EMAIL PROTECTED]>wrote:
> Well, if you will set split size to 1, you should get per-line split. > > > 2013/1/13 Dipesh Kumar Singh <[EMAIL PROTECTED]> > > > Hello users, > > > > I have an input file (1.2 MB) which contains list of words/phrases in > every > > new line. I am reading each phrase per line and passing it to udf to > > correct/check that phrase. > > The udf (simple extends eval func) refers and reads a dictionary file of > 6 > > MB for each input phrase. > > > > Since, the input dataset is very small, Pig launches only one mapper (out > > of 150 slots) to process the input and no parallelism is gained here. > > > > I would like to get some input/suggestions on how these kind of scenarios > > are efficiently implemented in pig. > > > > =====code snip===> > > > register 'Dudfs.jar'; > > define CorrectPhrases CorrectPhrases('/user/home/big.txt'); > > input_term = load '/user/home/input.txt' using PigStorage('\n') as > > (phrase:chararray); > > checked_term = foreach input_term generate phrase, CorrectPhrases(phrase) > > as correctedTerms; > > store checked_term into '/user/home/corrected_phrases' using > > PigStorage(','); > > > > ==================================> > > > Forgive me if i am getting into wrong direction, feel free to correct me > > and suggest your ways. > > > > Thanks in advance! > > > > > > Regards, > > Dipesh > > -- > > Dipesh Kr. Singh > > > > > > -- > Best regards, > Vitalii Tymchyshyn >
-- Dipesh Kr. Singh
+
Dipesh Kumar Singh 2013-01-15, 18:22
|
|