Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> multiple file storage with pig


Copy link to this message
-
Re: multiple file storage with pig
hi:

If you do not  find an udf in piggybank or in another    resources that
works fine with your requeriments you can create your own udf to filter,
evaluate, storage, etc  or extend someone.

For example  to storage in multiple files you can use
http://pig.apache.org/docs/r0.11.1/api/org/apache/pig/piggybank/storage/MultiStorage.html

Cheers
2013/7/30 Pablo Nebrera <[EMAIL PROTECTED]>

> Hello
>
> I have this pig script:
>
> register '/path_to_jars/elephant-bird-pig-3.0.7.jar';
> register '/path_to_jars/json-simple-1.1.1.jar';
> register '/path_to_jars/redBorder-pig.jar';
>
> data = load '/data/events/2013/07/29/16h03/part-00001.gz' using
> com.twitter.elephantbird.pig.load.JsonLoader() as (json: map[]);
> cleaned = foreach data generate json#'timestamp'/3600*3600 as timestamp,
> (chararray) json#'sensor_name' as sensor_name, (int) json#'sig_generator'
> as sig_generator, (int) json#'sig_id' as sig_id, json as data;
> grouped = GROUP cleaned BY (timestamp, sensor_name, sig_generator, sig_id);
>
>
>
>
>
> The input file is json file something like:
>
> {"timestamp":1374820560, "sensor_id":2, "sensor_name":"sensor-produccion",
> "sig_generator":1, "sig_id":402, "rev":11, "priority":3,
> "classification":"Misc activity", "msg":"Snort Alert [1:402:11]",
> "payload":"XXXXXXXXX", "proto":"icmp", "proto_id":1, "src":3232287141,
> "src_str":"192.168.201.165", "src_name":"192.168.201.165", "src_net":"
> 0.0.0.0/0", "src_net_name":"0.0.0.0/0", "dst_name":"192.168.201.254",
> "dst_str":"192.168.201.254", "dst_net":"0.0.0.0/0", "dst_net_name":"
> 0.0.0.0/0", "src_country":"N/A", "dst_country":"N/A",
> "src_country_code":"N/A", "dst_country_code":"N/A", "srcport":0,
> "dst":3232287230, "dstport":0, "ethsrc":"0:25:90:56:91:2d",
> "ethdst":"6c:62:6d:42:46:c3", "ethlength":594, "vlan":201,
> "vlan_name":"interna", "vlan_priority":0, "vlan_drop":0, "ttl":64,
> "tos":192, "id":53186, "dgmlen":576, "iplen":65544, "icmptype":3,
> "icmpcode":3, "icmpid":0, "icmpseq":0}
> {"timestamp":1374820618, "sensor_id":2, "sensor_id_snort":0,
> "sensor_name":"sensor-produccion", "sig_generator":1, "sig_id":402,
> "rev":11, "priority":3, "classification":"Misc activity", "msg":"Snort
> Alert [1:402:11]", "payload":"XXXXX2", "proto":"icmp", "proto_id":1,
> "src":3232261121, "src_str":"192.168.100.1", "src_name":"192.168.100.1",
> "src_net":"0.0.0.0/0", "src_net_name":"0.0.0.0/0",
> "dst_name":"192.168.100.125", "dst_str":"192.168.100.125", "dst_net":"
> 0.0.0.0/0", "dst_net_name":"0.0.0.0/0", "src_country":"N/A",
> "dst_country":"N/A", "src_country_code":"N/A", "dst_country_code":"N/A",
> "srcport":0, "dst":3232261245, "dstport":0, "ethsrc":"6c:62:6d:42:46:c3",
> "ethdst":"0:1e:c9:ef:85:fd", "ethlength":105, "vlan":100,
> "vlan_name":"100", "vlan_priority":0, "vlan_drop":0, "ttl":64, "tos":192,
> "id":30974, "dgmlen":87, "iplen":89088, "icmptype":3, "icmpcode":3,
> "icmpid":0, "icmpseq":0}
>
>
> The describe of grouped variable is:
>
> grunt> describe grouped
> 2013-07-30 10:11:58,834 [main] WARN  org.apache.pig.PigServer - Encountered
> Warning IMPLICIT_CAST_TO_INT 1 time(s).
> grouped: {group: (timestamp: int,sensor_name: chararray,sig_generator:
> int,sig_id: int),cleaned: {(timestamp: int,sensor_name:
> chararray,sig_generator: int,sig_id: int,data: map[])}}
>
>
>
> And a dump example is:
>
>
> ((1374818400,sensor-produccion,1,402),{(1374818400,sensor-produccion,1,402,[dst_country_code#N/A,rev#11,sig_id#402,proto_id#1,src_net_name#
>
> 0.0.0.0/0,ethlength#105,payload#45003bd9d5400401117dc0a8647dc0a864141403502745aa4b2e1001000000d726564426f7264657244454c4c00101,dst#3232261245,dstport#0,timestamp#1374820435,sensor_id_snort#0,id#30968,vlan_name#100,tos#192,src_net#0.0.0.0/0,priority#3,src_name#192.168.100.1,dgmlen#87,ethsrc#6c:62:6d:42:46:c3,src#3232261121,icmpcode#3,src_str#192.168.100.1,srcport#0,sensor_id#2,dst_net#0.0.0.0/0,ttl#64,msg#SnortAlert
>
> [1:402:11],proto#icmp,vlan_priority#0,dst_country#N/A,dst_name#192.168.100.125,dst_net_name#
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB