Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Some optimization advices


Copy link to this message
-
Some optimization advices
Hi There,

I am a beginner, I achieved something, but I guess I could have done
better. Let me explain.
(Pig 0.10)

My data is DESCRIBE as :

  xmlToTuple: {(node_attr_id: int,node_attr_lon:
chararray,node_attr_lat: chararray,tag: {(tag_attr_k:
chararray,tag_attr_v: chararray)})}
and DUMP like this :

((100312088,45.2745669,-12.7776222,{(created_by,JOSM)}))
((100948454,45.2620946,-12.7849171,))
((100948519,45.2356985,-12.7707014,{(created_by,JOSM)}))
((704398904,45.2416667,-13.0058333,{(lat,-13.00583333),(lon,45.24166667)}))
((1230941976,45.0743117,-12.6888807,{(place,village)}))
((1230941977,45.0832807,-12.6810328,{(name,Mtsahara)}))
((1976927219,45.2272263,-12.7794359,))
((1751057677,45.2216163,-12.7825896,{(amenity,fast_food),(name,Brochetterie)}))
((1751057678,45.2216953,-12.7829678,{(amenity,fast_food),(name,Brochetterie)}))
((100948360,45.2338541,-12.7762230,{(amenity,ferry_terminal)}))
((362795028,45.2086809,-12.8062991,{(amenity,fuel),(operator,Total)}))

I want to extract the record which have a certain value for the
tag_attr_k field. For example, give me the record where there is a
tag_attr_k = amesity ? That should be :

(100948360,-12.7762230,45.2338541,{(amenity,ferry_terminal)})
(362795028,-12.8062991,45.2086809,{(operator,Total),(amenity,fuel)})
(1751057677,-12.7825896,45.2216163,{(amenity,fast_food),(name,Brochetterie)})
(1751057678,-12.7829678,45.2216953,{(amenity,fast_food),(name,Brochetterie)})

So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k,
tag_attr_v)...(tag_attr_k,tag_attr_v)}

I ended up with this script.
...
XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0); --removed top
including level bag
XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2 as lat,
FLATTEN (tag) as (key, value); --flatten the bag of tags
XmlTag3 =  FILTER XmlTag2 BY key == 'amenity'; -- get all the records
with amenity tags
XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build records with
all tags containing amenity tag
XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2 as lat,$8 as
key, $9 as value; -- re-build records : removing redundant field
XmlTag5 = GROUP XmlTag7 BY (id,lat,lon); -- re-build records : grouping
redundant records
XmlTag8 = foreach XmlTag5 { --rebuild records : id,lat,long
{(key,value)...(key,value)}
         tag = foreach XmlTag7 GENERATE  key, value;
     GENERATE group.id,group.lat,group.lon,tag;
};

Using this variable:

xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat:
chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})}
XmlTag: {null::node_attr_id: int,null::node_attr_lon:
chararray,null::node_attr_lat: chararray,null::tag: {(tag_attr_k:
chararray,tag_attr_v: chararray)}}
XmlTag2: {id: int,lon: chararray,lat: chararray,key: chararray,value:
chararray}
XmlTag3: {id: int,lon: chararray,lat: chararray,key: chararray,value:
chararray}
XmlTag4: {XmlTag3::id: int,XmlTag3::lon: chararray,XmlTag3::lat:
chararray,XmlTag3::key: chararray,XmlTag3::value: chararray,XmlTag2::id:
int,XmlTag2::lon: chararray,XmlTag2::lat: chararray,XmlTag2::key:
chararray,XmlTag2::value: chararray}
XmlTag7: {id: int,lon: chararray,lat: chararray,key: chararray,value:
chararray}
XmlTag5: {group: (id: int,lat: chararray,lon: chararray),XmlTag7: {(id:
int,lon: chararray,lat: chararray,key: chararray,value: chararray)}}
XmlTag8: {id: int,lat: chararray,lon: chararray,tag: {(key:
chararray,value: chararray)}}
I guess this not very straightforward and can be largely optimized.
Please give me some hints ?

Regards,
J�r�me
+
Cheolsoo Park 2013-01-31, 19:45
+
Jonathan Coveney 2013-01-31, 23:27
+
Jerome Pierson 2013-02-05, 16:06
+
Cheolsoo Park 2013-02-05, 22:13
+
Jerome Person 2013-02-05, 22:57
+
Prashant Kommireddi 2013-02-05, 23:10
+
Jerome Person 2013-02-06, 10:00
+
Cheolsoo Park 2013-02-06, 16:41
+
Jerome Person 2013-02-06, 16:55
+
psic 2013-02-05, 22:57
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB