Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> How to optimize my request


+
35niavlys 2013-08-19, 14:49
Copy link to this message
-
Re: How to optimize my request
I have a couple of ideas that MAY help. I'm not familiar with your data,
but these techniques might help.

First, this probably won't affect the performance, but rather than having 3
FILTER statements at the top of your script, you can use the SPLIT operator
to split your dataset into 3 datasets.

I'm not sure what purpose the COGROUP is serving, but this seems to be the
source of the bottleneck. One optimization technique you can try is to
GROUP your data first and then use nested FILTER statements to get your
counts.

For example, you have the following:

A = LOAD 'data' USING MyUDFLoader('data.xml');
filter_response_time_less_than_1_s = FILTER A BY (response_time < 1000.0);
filter_response_time_between_1_s_and_2_s = FILTER A BY (response_time >1000.0 AND response_time < 1999.0);
filter_response_time_between_greater_than_2_s = FILTER A BY (response_time
>= 2000.0);
star__zne_asfo_access_log = FOREACH ( COGROUP A BY
(date_day,url,date_minute,ret_code,serveur),

     filter_response_time_between_greater_than_2_s BY
(date_day,url,date_minute,ret_code,serveur),

     filter_response_time_less_than_1_s BY (date_day,url,date_minute,ret_
code,serveur),

     filter_response_time_between_1_s_and_2_s BY
(date_day,url,date_minute,ret_code,serveur) )
{
        GENERATE
                FLATTEN(group) AS (date_day,zne_asfo_url,date_
minute,zne_http_code,zne_asfo_server),
                (long)SUM((bag{tuple(long)})A.response_time) AS
response_time,
                COUNT(filter_response_time_less_than_1_s) AS
response_time_less_than_1_s,
                COUNT(filter_response_time_between_1_s_and_2_s) AS
response_time_between_1_s_and_2_s,
                COUNT(filter_response_time_between_greater_than_2_s) AS
response_time_between_greater_than_2_s,
                COUNT(A) AS nb_hit;
};

This can possibly be changed to

A = LOAD 'data' USING MyUDFLoader('data.xml');
star__zne_asfo_access_log = FOREACH (GROUP A BY
(date_day,url,date_minute,ret_code,serveur)) {
        filter_response_time_less_than_1_s = FILTER A BY (response_time <
1000.0);
        filter_response_time_between_1_s_and_2_s = FILTER A BY
(response_time >= 1000.0 AND response_time < 1999.0);
        filter_response_time_between_greater_than_2_s = FILTER A BY
(response_time >= 2000.0);
        GENERATE
                FLATTEN(group) AS (date_day,zne_asfo_url,date_
minute,zne_http_code,zne_asfo_server),
                (long) SUM((bag{tuple(long)})A.response_time) AS
response_time,
                COUNT(filter_response_time_less_than_1_s) AS
response_time_less_than_1_s,
                COUNT(filter_response_time_between_1_s_and_2_s) AS
response_time_between_1_s_and_2_s,
                COUNT(filter_response_time_between_greater_than_2_s) AS
response_time_between_greater_than_2_s,
                COUNT(A) AS nb_hit;
};

I think in the COGROUP, you're data has been duplicated 3 times (plus the
original) so you're joining 4 times the original size.
On Mon, Aug 19, 2013 at 10:49 AM, 35niavlys <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I want to execute a pig command in embedded java program. For moment, I try
> Pig in local mode. My data file size is around 15MB but the execution of
> this command is very long so I think my script need optimizations...
>
> My script :
>
> A = LOAD 'data' USING MyUDFLoader('data.xml');
> > filter_response_time_less_than_1_s = FILTER A BY (response_time <
> 1000.0);
> > filter_response_time_between_1_s_and_2_s = FILTER A BY (response_time >> 1000.0 AND response_time < 1999.0);
> > filter_response_time_between_greater_than_2_s = FILTER A BY
> (response_time >= 2000.0);
> > star__zne_asfo_access_log = FOREACH ( COGROUP A BY
> (date_day,url,date_minute,ret_code,serveur),
> filter_response_time_between_greater_than_2_s BY
> (date_day,url,date_minute,ret_code,serveur),
> filter_response_time_less_than_1_s BY
> (date_day,url,date_minute,ret_code,serveur),
> filter_response_time_between_1_s_and_2_s BY
> (date_day,url,date_minute,ret_code,serveur) )
> > {
> >         GENERATE
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB