Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Reducing pig operations in script


+
Cameron Gandevia 2011-11-02, 17:17
Copy link to this message
-
Re: Reducing pig operations in script
Hi Cameron,

Your script looks alright. Each of your steps process data in different
ways. Instead of cramming together them in a single statement (possibly via
some custom UDF), it makes sense to have them in a series of steps as you
have done for better readability and debuggability. Are you worried about
performance? You need not to. As long as your operations don't introduce a
unnecessary map-reduce boundary (which your script doesn't) you are good.

Hope it helps,
Ashutosh

On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[EMAIL PROTECTED]> wrote:

> Hey
>
> I am trying to extract performance metrics from some of my logs using Pig
> and have come up with the following. I feel like I might be performing one
> too many steps and was wondering if there is a way to reduce the number of
> FILTER/FOREACH operations I need to run. Still trying to learn the proper
> syntax.
>
> uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as
> body:CHARARRAY;
> metricLogLine = FILTER uniqLogs BY (body MATCHES
> '.*gr.perf.metrics.Category.*');
> metricLogData = FOREACH metricLogLine GENERATE host,
> REGEX_EXTRACT_ALL(body,
>
> '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)')
> AS regex;
> fltrdMetricLogData = FILTER metricLogData BY regex is not null;
> eventCategories = FOREACH fltrdMetricLogData GENERATE host, FLATTEN(regex)
> AS (category:CHARARRAY, event:CHARARRAY);
>
> Thanks
>
+
Dmitriy Ryaboy 2011-11-02, 20:06
+
Cameron Gandevia 2011-11-02, 20:14
+
Cameron Gandevia 2011-11-02, 20:45
+
Dmitriy Ryaboy 2011-11-03, 00:17
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB