Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Reducing pig operations in script


Copy link to this message
-
Re: Reducing pig operations in script
Ashutosh Chauhan 2011-11-02, 17:56
Hi Cameron,

Your script looks alright. Each of your steps process data in different
ways. Instead of cramming together them in a single statement (possibly via
some custom UDF), it makes sense to have them in a series of steps as you
have done for better readability and debuggability. Are you worried about
performance? You need not to. As long as your operations don't introduce a
unnecessary map-reduce boundary (which your script doesn't) you are good.

Hope it helps,
Ashutosh

On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[EMAIL PROTECTED]> wrote:

> Hey
>
> I am trying to extract performance metrics from some of my logs using Pig
> and have come up with the following. I feel like I might be performing one
> too many steps and was wondering if there is a way to reduce the number of
> FILTER/FOREACH operations I need to run. Still trying to learn the proper
> syntax.
>
> uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as
> body:CHARARRAY;
> metricLogLine = FILTER uniqLogs BY (body MATCHES
> '.*gr.perf.metrics.Category.*');
> metricLogData = FOREACH metricLogLine GENERATE host,
> REGEX_EXTRACT_ALL(body,
>
> '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)')
> AS regex;
> fltrdMetricLogData = FILTER metricLogData BY regex is not null;
> eventCategories = FOREACH fltrdMetricLogData GENERATE host, FLATTEN(regex)
> AS (category:CHARARRAY, event:CHARARRAY);
>
> Thanks
>