Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Pig efficiency


Copy link to this message
-
Re: Pig efficiency
Thomas Foricher 2013-04-30, 21:25
That's what I thought.

Thank you Gerrit!

Regards.

Thomas

Le 01/05/2013 06:04, Gerrit Jansen van Vuuren a �crit :
> Hadoop  is made for data starting at gigabytes.
> If you've got < 1-5 gigs of data you can always do it faster on a single
> machine, the use case for Hadoop is  when you have 40 gigs or 200 gigs, N
> TBs etc.
>
> Remember that sending a job for distributed processing has a certain amount
> of overhead, that you do not have when processing locally, this is why for
> 250mb you would always see faster processing times locally.
>
>
> Regards,
>   Gerrit
>
> On Tue, Apr 30, 2013 at 4:09 PM, Thomas Foricher
> <[EMAIL PROTECTED]>wrote:
>
>> Hello everybody,
>>
>> I'm very new on Pig and Hadoop, and I'm trying to do some efficiency tests
>> between different technologies. For that I'm using a movie database found
>> here :
>>
>> http://www.grouplens.org/**taxonomy/term/14<http://www.grouplens.org/taxonomy/term/14>
>>
>> It's a 10M entries database, it weigths about 250 Mb. I know it is
>> certainly not enough to see the Map/Reduce real power but I thought it
>> could give an idea.
>>
>> So the data are formated as below :
>>
>> movies.dat
>>
>>      movieid::title:genres
>>
>> ratings.dat
>>
>>      userid::movieid::rating::**timestamp
>>
>> 1. MySQL
>>
>> My first test on MySQL was to join the two tables on movieid, and group
>> them on "genres" and then compute the AVG ratings and get the TOP20 as so:
>>
>> SELECT genres,  AVG( a.ratings )
>> FROM ratings a
>> JOIN movies b ON a.movieid = b.movieid
>> GROUP BY genres
>> ORDER BY AVG( a.ratings ) DESC
>> LIMIT 20
>>
>> It tooks about two minutes to get the results.
>>
>> 2. Python/Pandas
>>
>> I've done the same with Python and Pandas library and took about 4 seconds
>> to dump the results.
>>
>> 3. Hadoop/Pig
>>
>> I've run a Ubuntu 12.04 64bit server release on VMware player. Install
>> Hadoop on pseudo distributed single node cluster and pig.
>>
>> Here is my complete code to get the same results as below:
>>
>> I had to clean the file first since you cannot use two separators with Pig
>> as so:
>>
>> movies = load 'movie/movies.dat' as (line : chararray);
>> moviesclean = foreach movies generate flatten(STRSPLIT(line, '::'));
>> STORE moviesclean INTO 'movies';
>>
>> ratings = load 'movie/ratings.dat' as (line : chararray);
>> ratingsclean = foreach ratings generate flatten(STRSPLIT(line, '::'));
>> STORE ratingsclean INTO 'ratings';
>>
>>
>> movies = LOAD 'movies' as (movieid:int, title:chararray, genres:chararray);
>> ratings = LOAD 'ratings' as (userid:int, movieid:int, rating:float,
>> timestamp:chararray);
>>
>> X = JOIN ratings BY movieid, movies BY movieid;
>> Y = GROUP X BY genres;
>> Z = FOREACH Y GENERATE group, AVG(X.rating);
>> STORE Z INTO 'results';
>>
>> # It took about 10 minutes to store the Z file
>>
>> Z = LOAD 'results' AS (genres:chararray, avgrating:float);
>> Zsorted = ORDER Z BY avgrating DESC;
>> results = LIMIT Zsorted 20;
>> dump results;
>>
>> # And here about 4 minutes
>>
>> So in total it takes about 15 minutes to do the same analysis. Do you know
>> if it's normal?
>> Is it because the power is only revealed from terabytes of data?
>>
>> Thanks
>>
>> Regards.
>>
>> Thomas
>>
>>
>>
>>
>> This message may contain confidential and/or privileged information. If it
>> has
>> been sent to you in error, please reply to advise the sender of the error
>> and
>> then immediately delete this message.
>>