-Re: Leveraging Pig effectively
Prashant Kommireddi 2012-04-07, 10:22
53000 records is a small dataset for comparison between Hadoop and DB.
Hadoop is really advantageous when the datasets are huge (GBs, TBs, PBs). I
am guessing the 53000 records must be a few MBs, and in most cases an
indexed database would always perform better on Joins on such small
datasets. With hadoop/pig, there is always a cost associated with spawning
up Map and Reduce tasks and reading the datasets. Your script should not
die with a larger dataset, though the database you are using might choke
(again, if your dataset is really huge).
Having said that and looking at your process, seems like you could achieve
the task of finding matching/non-matching records with an Outer Join? Steps
1,3 could be solved with an outer join and step 4,5 with an inner join.
On Sat, Apr 7, 2012 at 2:38 AM, Sarath <
[EMAIL PROTECTED]> wrote:
> Dear All,
> I have 2 data dumps (comma separated) each with around 53,000 records (
> just sample data. it could be 10times more than this in real time).
> I need to write a script to -
> 1. find matching records from these 2 dumps based on a set of matching
> 2. store matching records from each dump into database
> 3. find the remaining records from each dump
> 4. find matching records by excluding one of the matching field
> 5. again store matching records from each dump into database
> For step 1 I used "cogroup"
> For step 3 I split "cogroup" with nulls for dumps 2 & 1 respectively to
> get the remaining records for dumps 1 & 2
> For step 2 & 4 I used DBStorage UDF to store the records into DB. With
> this approach I get 4 store commands (2 commands for each dump at steps 2 &
> Before storing to DB I'm using another UDF to generate a running sequence
> number which will be stored as key for each record being stored.
> ====> The script for this entire process is creating 6 map-reduce jobs and
> taking about 10mins to complete on a cluster of 3 machines (1 master and 2
> The same requirement when done using a stored procedure is completing in 5
> mins. Now I'm worried that my script could kill in real time environments.
> Requesting to suggest -
> -> What am I missing?
> -> What can I do more to improve the performance that is in comparison to
> stored procedure?
> -> What changes and/or additions to be done so that the script is scalable
> to any amounts of data?
> Thanks in advance,