|
|
-
Reducing pig operations in script
Cameron Gandevia 2011-11-02, 17:17
Hey
I am trying to extract performance metrics from some of my logs using Pig and have come up with the following. I feel like I might be performing one too many steps and was wondering if there is a way to reduce the number of FILTER/FOREACH operations I need to run. Still trying to learn the proper syntax.
uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as body:CHARARRAY; metricLogLine = FILTER uniqLogs BY (body MATCHES '.*gr.perf.metrics.Category.*'); metricLogData = FOREACH metricLogLine GENERATE host, REGEX_EXTRACT_ALL(body, '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)') AS regex; fltrdMetricLogData = FILTER metricLogData BY regex is not null; eventCategories = FOREACH fltrdMetricLogData GENERATE host, FLATTEN(regex) AS (category:CHARARRAY, event:CHARARRAY);
Thanks
+
Cameron Gandevia 2011-11-02, 17:17
-
Re: Reducing pig operations in script
Ashutosh Chauhan 2011-11-02, 17:56
Hi Cameron,
Your script looks alright. Each of your steps process data in different ways. Instead of cramming together them in a single statement (possibly via some custom UDF), it makes sense to have them in a series of steps as you have done for better readability and debuggability. Are you worried about performance? You need not to. As long as your operations don't introduce a unnecessary map-reduce boundary (which your script doesn't) you are good.
Hope it helps, Ashutosh
On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[EMAIL PROTECTED]> wrote:
> Hey > > I am trying to extract performance metrics from some of my logs using Pig > and have come up with the following. I feel like I might be performing one > too many steps and was wondering if there is a way to reduce the number of > FILTER/FOREACH operations I need to run. Still trying to learn the proper > syntax. > > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as > body:CHARARRAY; > metricLogLine = FILTER uniqLogs BY (body MATCHES > '.*gr.perf.metrics.Category.*'); > metricLogData = FOREACH metricLogLine GENERATE host, > REGEX_EXTRACT_ALL(body, > > '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)') > AS regex; > fltrdMetricLogData = FILTER metricLogData BY regex is not null; > eventCategories = FOREACH fltrdMetricLogData GENERATE host, FLATTEN(regex) > AS (category:CHARARRAY, event:CHARARRAY); > > Thanks >
+
Ashutosh Chauhan 2011-11-02, 17:56
-
Re: Reducing pig operations in script
Dmitriy Ryaboy 2011-11-02, 20:06
Just to be explicit:
This:
x = FILTER something by num1 > 10 AND num2 < 12;
is equivalent to this:
x = FILTER something by num1 > 10; x = FILTER x by num2 < 12;
All non-blocking operators are evaluated in a streaming fashion, so you don't need to worry about combining them into a single operator.
On Wed, Nov 2, 2011 at 10:56 AM, Ashutosh Chauhan <[EMAIL PROTECTED]>wrote:
> Hi Cameron, > > Your script looks alright. Each of your steps process data in different > ways. Instead of cramming together them in a single statement (possibly via > some custom UDF), it makes sense to have them in a series of steps as you > have done for better readability and debuggability. Are you worried about > performance? You need not to. As long as your operations don't introduce a > unnecessary map-reduce boundary (which your script doesn't) you are good. > > Hope it helps, > Ashutosh > > On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[EMAIL PROTECTED]> > wrote: > > > Hey > > > > I am trying to extract performance metrics from some of my logs using Pig > > and have come up with the following. I feel like I might be performing > one > > too many steps and was wondering if there is a way to reduce the number > of > > FILTER/FOREACH operations I need to run. Still trying to learn the proper > > syntax. > > > > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as > > body:CHARARRAY; > > metricLogLine = FILTER uniqLogs BY (body MATCHES > > '.*gr.perf.metrics.Category.*'); > > metricLogData = FOREACH metricLogLine GENERATE host, > > REGEX_EXTRACT_ALL(body, > > > > > '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)') > > AS regex; > > fltrdMetricLogData = FILTER metricLogData BY regex is not null; > > eventCategories = FOREACH fltrdMetricLogData GENERATE host, > FLATTEN(regex) > > AS (category:CHARARRAY, event:CHARARRAY); > > > > Thanks > > >
+
Dmitriy Ryaboy 2011-11-02, 20:06
-
Re: Reducing pig operations in script
Cameron Gandevia 2011-11-02, 20:14
Cool thanks
On Wed, Nov 2, 2011 at 1:06 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> Just to be explicit: > > This: > > x = FILTER something by num1 > 10 AND num2 < 12; > > is equivalent to this: > > x = FILTER something by num1 > 10; > x = FILTER x by num2 < 12; > > All non-blocking operators are evaluated in a streaming fashion, so you > don't need to worry about combining them into a single operator. > > On Wed, Nov 2, 2011 at 10:56 AM, Ashutosh Chauhan <[EMAIL PROTECTED] > >wrote: > > > Hi Cameron, > > > > Your script looks alright. Each of your steps process data in different > > ways. Instead of cramming together them in a single statement (possibly > via > > some custom UDF), it makes sense to have them in a series of steps as you > > have done for better readability and debuggability. Are you worried about > > performance? You need not to. As long as your operations don't introduce > a > > unnecessary map-reduce boundary (which your script doesn't) you are good. > > > > Hope it helps, > > Ashutosh > > > > On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[EMAIL PROTECTED]> > > wrote: > > > > > Hey > > > > > > I am trying to extract performance metrics from some of my logs using > Pig > > > and have come up with the following. I feel like I might be performing > > one > > > too many steps and was wondering if there is a way to reduce the number > > of > > > FILTER/FOREACH operations I need to run. Still trying to learn the > proper > > > syntax. > > > > > > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as > > > body:CHARARRAY; > > > metricLogLine = FILTER uniqLogs BY (body MATCHES > > > '.*gr.perf.metrics.Category.*'); > > > metricLogData = FOREACH metricLogLine GENERATE host, > > > REGEX_EXTRACT_ALL(body, > > > > > > > > > '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)') > > > AS regex; > > > fltrdMetricLogData = FILTER metricLogData BY regex is not null; > > > eventCategories = FOREACH fltrdMetricLogData GENERATE host, > > FLATTEN(regex) > > > AS (category:CHARARRAY, event:CHARARRAY); > > > > > > Thanks > > > > > >
-- Thanks
Cameron Gandevia
+
Cameron Gandevia 2011-11-02, 20:14
-
Re: Reducing pig operations in script
Cameron Gandevia 2011-11-02, 20:45
In the pig documentation there is a section title Reduce your operator pipeline which talks about combining foreach statements as an optimization. It also mentions you should do the same for filter statements. Is this incorrect?
On Wed, Nov 2, 2011 at 1:14 PM, Cameron Gandevia <[EMAIL PROTECTED]>wrote:
> Cool thanks > > > On Wed, Nov 2, 2011 at 1:06 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > >> Just to be explicit: >> >> This: >> >> x = FILTER something by num1 > 10 AND num2 < 12; >> >> is equivalent to this: >> >> x = FILTER something by num1 > 10; >> x = FILTER x by num2 < 12; >> >> All non-blocking operators are evaluated in a streaming fashion, so you >> don't need to worry about combining them into a single operator. >> >> On Wed, Nov 2, 2011 at 10:56 AM, Ashutosh Chauhan <[EMAIL PROTECTED] >> >wrote: >> >> > Hi Cameron, >> > >> > Your script looks alright. Each of your steps process data in different >> > ways. Instead of cramming together them in a single statement (possibly >> via >> > some custom UDF), it makes sense to have them in a series of steps as >> you >> > have done for better readability and debuggability. Are you worried >> about >> > performance? You need not to. As long as your operations don't >> introduce a >> > unnecessary map-reduce boundary (which your script doesn't) you are >> good. >> > >> > Hope it helps, >> > Ashutosh >> > >> > On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[EMAIL PROTECTED]> >> > wrote: >> > >> > > Hey >> > > >> > > I am trying to extract performance metrics from some of my logs using >> Pig >> > > and have come up with the following. I feel like I might be performing >> > one >> > > too many steps and was wondering if there is a way to reduce the >> number >> > of >> > > FILTER/FOREACH operations I need to run. Still trying to learn the >> proper >> > > syntax. >> > > >> > > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as >> > > body:CHARARRAY; >> > > metricLogLine = FILTER uniqLogs BY (body MATCHES >> > > '.*gr.perf.metrics.Category.*'); >> > > metricLogData = FOREACH metricLogLine GENERATE host, >> > > REGEX_EXTRACT_ALL(body, >> > > >> > > >> > >> '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)') >> > > AS regex; >> > > fltrdMetricLogData = FILTER metricLogData BY regex is not null; >> > > eventCategories = FOREACH fltrdMetricLogData GENERATE host, >> > FLATTEN(regex) >> > > AS (category:CHARARRAY, event:CHARARRAY); >> > > >> > > Thanks >> > > >> > >> > > > > -- > Thanks > > Cameron Gandevia >
-- Thanks
Cameron Gandevia
+
Cameron Gandevia 2011-11-02, 20:45
-
Re: Reducing pig operations in script
Dmitriy Ryaboy 2011-11-03, 00:17
Let's just say it's overly optimistic w.r.t. what actually takes time in a pig job.
D
On Wed, Nov 2, 2011 at 1:45 PM, Cameron Gandevia <[EMAIL PROTECTED]>wrote:
> In the pig documentation there is a section title Reduce your operator > pipeline which talks about combining foreach statements as an optimization. > It also mentions you should do the same for filter statements. Is this > incorrect? > > On Wed, Nov 2, 2011 at 1:14 PM, Cameron Gandevia <[EMAIL PROTECTED] > >wrote: > > > Cool thanks > > > > > > On Wed, Nov 2, 2011 at 1:06 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > >> Just to be explicit: > >> > >> This: > >> > >> x = FILTER something by num1 > 10 AND num2 < 12; > >> > >> is equivalent to this: > >> > >> x = FILTER something by num1 > 10; > >> x = FILTER x by num2 < 12; > >> > >> All non-blocking operators are evaluated in a streaming fashion, so you > >> don't need to worry about combining them into a single operator. > >> > >> On Wed, Nov 2, 2011 at 10:56 AM, Ashutosh Chauhan <[EMAIL PROTECTED] > >> >wrote: > >> > >> > Hi Cameron, > >> > > >> > Your script looks alright. Each of your steps process data in > different > >> > ways. Instead of cramming together them in a single statement > (possibly > >> via > >> > some custom UDF), it makes sense to have them in a series of steps as > >> you > >> > have done for better readability and debuggability. Are you worried > >> about > >> > performance? You need not to. As long as your operations don't > >> introduce a > >> > unnecessary map-reduce boundary (which your script doesn't) you are > >> good. > >> > > >> > Hope it helps, > >> > Ashutosh > >> > > >> > On Wed, Nov 2, 2011 at 10:17, Cameron Gandevia <[EMAIL PROTECTED]> > >> > wrote: > >> > > >> > > Hey > >> > > > >> > > I am trying to extract performance metrics from some of my logs > using > >> Pig > >> > > and have come up with the following. I feel like I might be > performing > >> > one > >> > > too many steps and was wondering if there is a way to reduce the > >> number > >> > of > >> > > FILTER/FOREACH operations I need to run. Still trying to learn the > >> proper > >> > > syntax. > >> > > > >> > > uniqLogs = FOREACH logs GENERATE host as host:CHARARRAY, body as > >> > > body:CHARARRAY; > >> > > metricLogLine = FILTER uniqLogs BY (body MATCHES > >> > > '.*gr.perf.metrics.Category.*'); > >> > > metricLogData = FOREACH metricLogLine GENERATE host, > >> > > REGEX_EXTRACT_ALL(body, > >> > > > >> > > > >> > > >> > '.*gr.perf.metrics.Category\\s*\\-\\s*([A-Za-z\\.\\_]+)\\s+([A-Za-z\\_\\.]+)') > >> > > AS regex; > >> > > fltrdMetricLogData = FILTER metricLogData BY regex is not null; > >> > > eventCategories = FOREACH fltrdMetricLogData GENERATE host, > >> > FLATTEN(regex) > >> > > AS (category:CHARARRAY, event:CHARARRAY); > >> > > > >> > > Thanks > >> > > > >> > > >> > > > > > > > > -- > > Thanks > > > > Cameron Gandevia > > > > > > -- > Thanks > > Cameron Gandevia >
+
Dmitriy Ryaboy 2011-11-03, 00:17
|
|