Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Some optimization advices


Copy link to this message
-
Re: Some optimization advices
Hi Jerome,

Try this:

XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0);
XmlTag2 = FOREACH XmlTag {
    tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity');
    GENERATE *, COUNT(tag_with_amenity) AS count;
};
XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE node_attr_id,
node_attr_lon, node_attr_lat, tag;

Thanks,
Cheolsoo
On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson
<[EMAIL PROTECTED]>wrote:

> Hi There,
>
> I am a beginner, I achieved something, but I guess I could have done
> better. Let me explain.
> (Pig 0.10)
>
> My data is DESCRIBE as :
>
>  xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat:
> chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})}
>
>
> and DUMP like this :
>
> ((100312088,45.2745669,-12.**7776222,{(created_by,JOSM)}))
> ((100948454,45.2620946,-12.**7849171,))
> ((100948519,45.2356985,-12.**7707014,{(created_by,JOSM)}))
> ((704398904,45.2416667,-13.**0058333,{(lat,-13.00583333),(**
> lon,45.24166667)}))
> ((1230941976,45.0743117,-12.**6888807,{(place,village)}))
> ((1230941977,45.0832807,-12.**6810328,{(name,Mtsahara)}))
> ((1976927219,45.2272263,-12.**7794359,))
> ((1751057677,45.2216163,-12.**7825896,{(amenity,fast_food),(**
> name,Brochetterie)}))
> ((1751057678,45.2216953,-12.**7829678,{(amenity,fast_food),(**
> name,Brochetterie)}))
> ((100948360,45.2338541,-12.**7762230,{(amenity,ferry_**terminal)}))
> ((362795028,45.2086809,-12.**8062991,{(amenity,fuel),(**operator,Total)}))
>
> I want to extract the record which have a certain value for the tag_attr_k
> field. For example, give me the record where there is a tag_attr_k > amesity ? That should be :
>
> (100948360,-12.7762230,45.**2338541,{(amenity,ferry_**terminal)})
> (362795028,-12.8062991,45.**2086809,{(operator,Total),(**amenity,fuel)})
> (1751057677,-12.7825896,45.**2216163,{(amenity,fast_food),(**
> name,Brochetterie)})
> (1751057678,-12.7829678,45.**2216953,{(amenity,fast_food),(**
> name,Brochetterie)})
>
> So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k,
> tag_attr_v)...(tag_attr_k,tag_**attr_v)}
>
> I ended up with this script.
>
>
> ...
> XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0); --removed top including
> level bag
> XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2 as lat, FLATTEN
> (tag) as (key, value); --flatten the bag of tags
> XmlTag3 =  FILTER XmlTag2 BY key == 'amenity'; -- get all the records with
> amenity tags
> XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build records with all
> tags containing amenity tag
> XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2 as lat,$8 as
> key, $9 as value; -- re-build records : removing redundant field
> XmlTag5 = GROUP XmlTag7 BY (id,lat,lon); -- re-build records : grouping
> redundant records
> XmlTag8 = foreach XmlTag5 { --rebuild records : id,lat,long
> {(key,value)...(key,value)}
>         tag = foreach XmlTag7 GENERATE  key, value;
>     GENERATE group.id,group.lat,group.lon,**tag;
> };
>
> Using this variable:
>
> xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat:
> chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})}
> XmlTag: {null::node_attr_id: int,null::node_attr_lon:
> chararray,null::node_attr_lat: chararray,null::tag: {(tag_attr_k:
> chararray,tag_attr_v: chararray)}}
> XmlTag2: {id: int,lon: chararray,lat: chararray,key: chararray,value:
> chararray}
> XmlTag3: {id: int,lon: chararray,lat: chararray,key: chararray,value:
> chararray}
> XmlTag4: {XmlTag3::id: int,XmlTag3::lon: chararray,XmlTag3::lat:
> chararray,XmlTag3::key: chararray,XmlTag3::value: chararray,XmlTag2::id:
> int,XmlTag2::lon: chararray,XmlTag2::lat: chararray,XmlTag2::key:
> chararray,XmlTag2::value: chararray}
> XmlTag7: {id: int,lon: chararray,lat: chararray,key: chararray,value:
> chararray}
> XmlTag5: {group: (id: int,lat: chararray,lon: chararray),XmlTag7: {(id:
> int,lon: chararray,lat: chararray,key: chararray,value: chararray)}}
> XmlTag8: {id: int,lat: chararray,lon: chararray,tag: {(key:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB