Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Iterating over data set


+
Xuri Nagarin 2013-07-29, 01:34
+
Jacob Perkins 2013-07-29, 13:05
+
Xuri Nagarin 2013-07-30, 06:30
Copy link to this message
-
Re: Iterating over data set
Xuri,

I don't think you can use functions in the load statement like that. To do something like that you'd need to write your own LoadFunc. As far as I can tell at a glance, and I haven't used Pig 0.11 much, the new DateTime functions are eval functions. That means they only operate on tuples during execution (map-reduce or whatever emulates map-reduce in local mode) and _after_ the input location has been resolved.

--jacob
@thedatachef
On Jul 30, 2013, at 1:30 AM, Xuri Nagarin wrote:

> Thanks Jacob.
>
> I threw in a little bash shell hack to make it simpler. Before I run the
> pig script, I run a bash script that stores timestamps in a day for every
> 20 minutes:
>
> *Shell*
>
> dt1=`date -u +%Y-%m-%dT00:00:00.000Z -d "10 days ago"` #10 days ago because
> I get data 10 days late :-)
>
>
> for ((i=0;i<1430; i=i+10)) ; do date -u "+%Y-%m-%dT%H:%M:%S.000Z" -d "$dt1
> +$i mins"; done
>
> Gives me:
> .
> .
> 2013-07-20T22:00:00.000Z
> 2013-07-20T22:10:00.000Z
> 2013-07-20T22:20:00.000Z
> 2013-07-20T22:30:00.000Z
> .
> .
>
> If the file above is generated by using the date as filename then I call it
> in my pig script as:
>
> %declare filepath `date -u +%Y-%m-%d -d "10 days ago"`;
> A1 = LOAD '$filepath.ts' USING PigStorage() AS (dt:datetime);
>
> Now, I can iterate over it:
>
> B = FOREACH A1 {
>
>    C = FILTER A BY timestamp > 'dt' AND timestamp <
> 'AddDuration(ToDate(dt),PT20M)' ;
>    .
> do something()
> }
>
> What I want to do is not use the bash command and instead use Pig's
> datetime functions. Unfortunately, I am stuck in syntactical hell.
>
> A = LOAD
> '/path/to/logs/ToDate(SubtractDuration(CurrentTime(),'P3D'),'yyyy-MM-dd')'
> USING PigStorage();
>
> yields:
> "2013-07-29 23:28:05,565 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: <line 1, column 66> mismatched input 'P3D' expecting SEMI_COLON"
>
> I have tried various combinations of enclosing the date calculation
> functions in single quotes, brackets etc but can't seem to get anything to
> work :(
>
>
>
>
> On Mon, Jul 29, 2013 at 6:05 AM, Jacob Perkins <[EMAIL PROTECTED]>wrote:
>
>> Hi Xuri,
>>
>> This illustrates the use case for a UDF I've had to implement in one form
>> or another called 'FilterBag'. It's essentially just Pig's builtin "FILTER"
>> but would work like so (using your pseudocode):
>>
>>
>> A = load 'input' as (timestamp, worker, output);
>>
>> --
>> -- Assuming you want to restrict each calculation to a day. ToDay is most
>> likely going to be Piggybank's
>> -- ISOToDay truncation udf
>> --
>> with_day = foreach A generate timestamp, ToDay(timestamp) as day, worker,
>> output;
>>
>> --
>> -- First you'll have to get all output for a given worker on a given day
>> into single bag
>> --
>> worker_output = foreach (group with_day by (worker, day)) {
>>                          -- this relation (worker_output) will have one
>> tuple per unique worker, day, and timestamp
>>                          timestamps = distinct with_day.timestamp;
>>                          generate
>>                            flatten(group)                        as
>> (worker, day),
>>                            flatten(timestamps)               as t1,
>>                            with_day.(timestamp, output) as outputs; -- A
>> bag that contains all of this workers output and their timestamps for this
>> day
>>                         };
>>
>> --
>> -- Next, filter each "outputs" bag to contain only outputs that occurred
>> within a 10 minute (or whatever time unit of interest) window from the
>> -- timestamp, looking forward (whether you look forward, back, or both is
>> up to you)
>> --
>> windowed = foreach worker_output {
>>                    -- FilterBag(bag, field_num, comparison_string,
>> to_compare)
>>                    -- bag: bag to filter
>>                    -- field_num: 0 indexed field num of the tuples in the
>> bag to use for comparison to "to_compare"
>>                    -- comparison_string
+
Amit 2013-07-30, 13:18
+
Xuri Nagarin 2013-07-30, 20:54
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB