Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Iterating over data set


+
Xuri Nagarin 2013-07-29, 01:34
+
Jacob Perkins 2013-07-29, 13:05
+
Xuri Nagarin 2013-07-30, 06:30
Copy link to this message
-
Re: Iterating over data set
Xuri,

I don't think you can use functions in the load statement like that. To do something like that you'd need to write your own LoadFunc. As far as I can tell at a glance, and I haven't used Pig 0.11 much, the new DateTime functions are eval functions. That means they only operate on tuples during execution (map-reduce or whatever emulates map-reduce in local mode) and _after_ the input location has been resolved.

--jacob
@thedatachef
On Jul 30, 2013, at 1:30 AM, Xuri Nagarin wrote:

> Thanks Jacob.
>
> I threw in a little bash shell hack to make it simpler. Before I run the
> pig script, I run a bash script that stores timestamps in a day for every
> 20 minutes:
>
> *Shell*
>
> dt1=`date -u +%Y-%m-%dT00:00:00.000Z -d "10 days ago"` #10 days ago because
> I get data 10 days late :-)
>
>
> for ((i=0;i<1430; i=i+10)) ; do date -u "+%Y-%m-%dT%H:%M:%S.000Z" -d "$dt1
> +$i mins"; done
>
> Gives me:
> .
> .
> 2013-07-20T22:00:00.000Z
> 2013-07-20T22:10:00.000Z
> 2013-07-20T22:20:00.000Z
> 2013-07-20T22:30:00.000Z
> .
> .
>
> If the file above is generated by using the date as filename then I call it
> in my pig script as:
>
> %declare filepath `date -u +%Y-%m-%d -d "10 days ago"`;
> A1 = LOAD '$filepath.ts' USING PigStorage() AS (dt:datetime);
>
> Now, I can iterate over it:
>
> B = FOREACH A1 {
>
>    C = FILTER A BY timestamp > 'dt' AND timestamp <
> 'AddDuration(ToDate(dt),PT20M)' ;
>    .
> do something()
> }
>
> What I want to do is not use the bash command and instead use Pig's
> datetime functions. Unfortunately, I am stuck in syntactical hell.
>
> A = LOAD
> '/path/to/logs/ToDate(SubtractDuration(CurrentTime(),'P3D'),'yyyy-MM-dd')'
> USING PigStorage();
>
> yields:
> "2013-07-29 23:28:05,565 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: <line 1, column 66> mismatched input 'P3D' expecting SEMI_COLON"
>
> I have tried various combinations of enclosing the date calculation
> functions in single quotes, brackets etc but can't seem to get anything to
> work :(
>
>
>
>
> On Mon, Jul 29, 2013 at 6:05 AM, Jacob Perkins <[EMAIL PROTECTED]>wrote:
>
>> Hi Xuri,
>>
>> This illustrates the use case for a UDF I've had to implement in one form
>> or another called 'FilterBag'. It's essentially just Pig's builtin "FILTER"
>> but would work like so (using your pseudocode):
>>
>>
>> A = load 'input' as (timestamp, worker, output);
>>
>> --
>> -- Assuming you want to restrict each calculation to a day. ToDay is most
>> likely going to be Piggybank's
>> -- ISOToDay truncation udf
>> --
>> with_day = foreach A generate timestamp, ToDay(timestamp) as day, worker,
>> output;
>>
>> --
>> -- First you'll have to get all output for a given worker on a given day
>> into single bag
>> --
>> worker_output = foreach (group with_day by (worker, day)) {
>>                          -- this relation (worker_output) will have one
>> tuple per unique worker, day, and timestamp
>>                          timestamps = distinct with_day.timestamp;
>>                          generate
>>                            flatten(group)                        as
>> (worker, day),
>>                            flatten(timestamps)               as t1,
>>                            with_day.(timestamp, output) as outputs; -- A
>> bag that contains all of this workers output and their timestamps for this
>> day
>>                         };
>>
>> --
>> -- Next, filter each "outputs" bag to contain only outputs that occurred
>> within a 10 minute (or whatever time unit of interest) window from the
>> -- timestamp, looking forward (whether you look forward, back, or both is
>> up to you)
>> --
>> windowed = foreach worker_output {
>>                    -- FilterBag(bag, field_num, comparison_string,
>> to_compare)
>>                    -- bag: bag to filter
>>                    -- field_num: 0 indexed field num of the tuples in the
>> bag to use for comparison to "to_compare"
>>                    -- comparison_string
+
Amit 2013-07-30, 13:18
+
Xuri Nagarin 2013-07-30, 20:54