Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Reducing pig operations in script

Cameron Gandevia 2011-11-02, 17:17
Copy link to this message
Re: Reducing pig operations in script
Hi Cameron,

Your script looks alright. Each of your steps process data in different
ways. Instead of cramming together them in a single statement (possibly via
some custom UDF), it makes sense to have them in a series of steps as you
have done for better readability and debuggability. Are you worried about
performance? You need not to. As long as your operations don't introduce a
unnecessary map-reduce boundary (which your script doesn't) you are good.

Hope it helps,

On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[EMAIL PROTECTED]> wrote:

> Hey
> I am trying to extract performance metrics from some of my logs using Pig
> and have come up with the following. I feel like I might be performing one
> too many steps and was wondering if there is a way to reduce the number of
> FILTER/FOREACH operations I need to run. Still trying to learn the proper
> syntax.
> uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as
> metricLogLine = FILTER uniqLogs BY (body MATCHES
> '.*gr.perf.metrics.Category.*');
> metricLogData = FOREACH metricLogLine GENERATE host,
> '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)')
> AS regex;
> fltrdMetricLogData = FILTER metricLogData BY regex is not null;
> eventCategories = FOREACH fltrdMetricLogData GENERATE host, FLATTEN(regex)
> AS (category:CHARARRAY, event:CHARARRAY);
> Thanks
Dmitriy Ryaboy 2011-11-02, 20:06
Cameron Gandevia 2011-11-02, 20:14
Cameron Gandevia 2011-11-02, 20:45
Dmitriy Ryaboy 2011-11-03, 00:17