Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Parallelism for small input data


+
Dipesh Kumar Singh 2013-01-13, 12:47
+
Dmitriy Ryaboy 2013-01-13, 22:54
Copy link to this message
-
Re: Parallelism for small input data
Well, if you will set split size to 1, you should get per-line split.
2013/1/13 Dipesh Kumar Singh <[EMAIL PROTECTED]>

> Hello users,
>
> I have an input file (1.2 MB) which contains list of words/phrases in every
> new line. I am reading each phrase per line and passing it to udf to
> correct/check that phrase.
> The udf (simple extends eval func) refers and reads a dictionary file of 6
> MB for each input phrase.
>
> Since, the input dataset is very small, Pig launches only one mapper (out
> of 150 slots) to process the input and no parallelism is gained here.
>
> I would like to get some input/suggestions on how these kind of scenarios
> are efficiently implemented in pig.
>
> =====code snip===>
> register 'Dudfs.jar';
> define CorrectPhrases CorrectPhrases('/user/home/big.txt');
> input_term = load '/user/home/input.txt' using PigStorage('\n') as
> (phrase:chararray);
> checked_term = foreach input_term generate phrase, CorrectPhrases(phrase)
> as correctedTerms;
> store checked_term into '/user/home/corrected_phrases' using
> PigStorage(',');
>
> ==================================>
> Forgive me if i am getting into wrong direction, feel free to correct me
> and suggest your ways.
>
> Thanks in advance!
>
>
> Regards,
> Dipesh
> --
> Dipesh Kr. Singh
>

--
Best regards,
 Vitalii Tymchyshyn
+
Dipesh Kumar Singh 2013-01-15, 18:22
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB