Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Some optimization advices


Copy link to this message
-
Re: Some optimization advices
Even better, push the tag_with_amenity = FILTER tag BY (tag_attr_k ='amenity'); as high as possible.
2013/1/31 Cheolsoo Park <[EMAIL PROTECTED]>

> Hi Jerome,
>
> Try this:
>
> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0);
> XmlTag2 = FOREACH XmlTag {
>     tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity');
>     GENERATE *, COUNT(tag_with_amenity) AS count;
> };
> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE node_attr_id,
> node_attr_lon, node_attr_lat, tag;
>
> Thanks,
> Cheolsoo
>
>
> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson
> <[EMAIL PROTECTED]>wrote:
>
> > Hi There,
> >
> > I am a beginner, I achieved something, but I guess I could have done
> > better. Let me explain.
> > (Pig 0.10)
> >
> > My data is DESCRIBE as :
> >
> >  xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat:
> > chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})}
> >
> >
> > and DUMP like this :
> >
> > ((100312088,45.2745669,-12.**7776222,{(created_by,JOSM)}))
> > ((100948454,45.2620946,-12.**7849171,))
> > ((100948519,45.2356985,-12.**7707014,{(created_by,JOSM)}))
> > ((704398904,45.2416667,-13.**0058333,{(lat,-13.00583333),(**
> > lon,45.24166667)}))
> > ((1230941976,45.0743117,-12.**6888807,{(place,village)}))
> > ((1230941977,45.0832807,-12.**6810328,{(name,Mtsahara)}))
> > ((1976927219,45.2272263,-12.**7794359,))
> > ((1751057677,45.2216163,-12.**7825896,{(amenity,fast_food),(**
> > name,Brochetterie)}))
> > ((1751057678,45.2216953,-12.**7829678,{(amenity,fast_food),(**
> > name,Brochetterie)}))
> > ((100948360,45.2338541,-12.**7762230,{(amenity,ferry_**terminal)}))
> >
> ((362795028,45.2086809,-12.**8062991,{(amenity,fuel),(**operator,Total)}))
> >
> > I want to extract the record which have a certain value for the
> tag_attr_k
> > field. For example, give me the record where there is a tag_attr_k > > amesity ? That should be :
> >
> > (100948360,-12.7762230,45.**2338541,{(amenity,ferry_**terminal)})
> > (362795028,-12.8062991,45.**2086809,{(operator,Total),(**amenity,fuel)})
> > (1751057677,-12.7825896,45.**2216163,{(amenity,fast_food),(**
> > name,Brochetterie)})
> > (1751057678,-12.7829678,45.**2216953,{(amenity,fast_food),(**
> > name,Brochetterie)})
> >
> > So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k,
> > tag_attr_v)...(tag_attr_k,tag_**attr_v)}
> >
> > I ended up with this script.
> >
> >
> > ...
> > XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0); --removed top
> including
> > level bag
> > XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2 as lat, FLATTEN
> > (tag) as (key, value); --flatten the bag of tags
> > XmlTag3 =  FILTER XmlTag2 BY key == 'amenity'; -- get all the records
> with
> > amenity tags
> > XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build records with all
> > tags containing amenity tag
> > XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2 as lat,$8 as
> > key, $9 as value; -- re-build records : removing redundant field
> > XmlTag5 = GROUP XmlTag7 BY (id,lat,lon); -- re-build records : grouping
> > redundant records
> > XmlTag8 = foreach XmlTag5 { --rebuild records : id,lat,long
> > {(key,value)...(key,value)}
> >         tag = foreach XmlTag7 GENERATE  key, value;
> >     GENERATE group.id,group.lat,group.lon,**tag;
> > };
> >
> > Using this variable:
> >
> > xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat:
> > chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})}
> > XmlTag: {null::node_attr_id: int,null::node_attr_lon:
> > chararray,null::node_attr_lat: chararray,null::tag: {(tag_attr_k:
> > chararray,tag_attr_v: chararray)}}
> > XmlTag2: {id: int,lon: chararray,lat: chararray,key: chararray,value:
> > chararray}
> > XmlTag3: {id: int,lon: chararray,lat: chararray,key: chararray,value:
> > chararray}
> > XmlTag4: {XmlTag3::id: int,XmlTag3::lon: chararray,XmlTag3::lat:
> > chararray,XmlTag3::key: chararray,XmlTag3::value: chararray,XmlTag2::id:
> > int,XmlTag2::lon: chararray,XmlTag2::lat: chararray,XmlTag2::key:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB