Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Parallelism for small input data


Copy link to this message
-
Parallelism for small input data
Hello users,

I have an input file (1.2 MB) which contains list of words/phrases in every
new line. I am reading each phrase per line and passing it to udf to
correct/check that phrase.
The udf (simple extends eval func) refers and reads a dictionary file of 6
MB for each input phrase.

Since, the input dataset is very small, Pig launches only one mapper (out
of 150 slots) to process the input and no parallelism is gained here.

I would like to get some input/suggestions on how these kind of scenarios
are efficiently implemented in pig.

=====code snip===
register 'Dudfs.jar';
define CorrectPhrases CorrectPhrases('/user/home/big.txt');
input_term = load '/user/home/input.txt' using PigStorage('\n') as
(phrase:chararray);
checked_term = foreach input_term generate phrase, CorrectPhrases(phrase)
as correctedTerms;
store checked_term into '/user/home/corrected_phrases' using
PigStorage(',');

==================================
Forgive me if i am getting into wrong direction, feel free to correct me
and suggest your ways.

Thanks in advance!
Regards,
Dipesh
--
Dipesh Kr. Singh
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB