Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Some optimization advices


Copy link to this message
-
Re: Some optimization advices
Jerome Pierson 2013-02-05, 16:06
Thaks a lot. It works fine.

But one more point, I have only one mapper running with this pig job as
my cluster has 4 slaves.
How could it be different ?

Regards,
J�r�me
Le 31/01/2013 20:45, Cheolsoo Park a �crit :
> Hi Jerome,
>
> Try this:
>
> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0);
> XmlTag2 = FOREACH XmlTag {
>      tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity');
>      GENERATE *, COUNT(tag_with_amenity) AS count;
> };
> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE node_attr_id,
> node_attr_lon, node_attr_lat, tag;
>
> Thanks,
> Cheolsoo
>
>
> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson
> <[EMAIL PROTECTED]>wrote:
>
>> Hi There,
>>
>> I am a beginner, I achieved something, but I guess I could have done
>> better. Let me explain.
>> (Pig 0.10)
>>
>> My data is DESCRIBE as :
>>
>>   xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat:
>> chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})}
>>
>>
>> and DUMP like this :
>>
>> ((100312088,45.2745669,-12.**7776222,{(created_by,JOSM)}))
>> ((100948454,45.2620946,-12.**7849171,))
>> ((100948519,45.2356985,-12.**7707014,{(created_by,JOSM)}))
>> ((704398904,45.2416667,-13.**0058333,{(lat,-13.00583333),(**
>> lon,45.24166667)}))
>> ((1230941976,45.0743117,-12.**6888807,{(place,village)}))
>> ((1230941977,45.0832807,-12.**6810328,{(name,Mtsahara)}))
>> ((1976927219,45.2272263,-12.**7794359,))
>> ((1751057677,45.2216163,-12.**7825896,{(amenity,fast_food),(**
>> name,Brochetterie)}))
>> ((1751057678,45.2216953,-12.**7829678,{(amenity,fast_food),(**
>> name,Brochetterie)}))
>> ((100948360,45.2338541,-12.**7762230,{(amenity,ferry_**terminal)}))
>> ((362795028,45.2086809,-12.**8062991,{(amenity,fuel),(**operator,Total)}))
>>
>> I want to extract the record which have a certain value for the tag_attr_k
>> field. For example, give me the record where there is a tag_attr_k >> amesity ? That should be :
>>
>> (100948360,-12.7762230,45.**2338541,{(amenity,ferry_**terminal)})
>> (362795028,-12.8062991,45.**2086809,{(operator,Total),(**amenity,fuel)})
>> (1751057677,-12.7825896,45.**2216163,{(amenity,fast_food),(**
>> name,Brochetterie)})
>> (1751057678,-12.7829678,45.**2216953,{(amenity,fast_food),(**
>> name,Brochetterie)})
>>
>> So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k,
>> tag_attr_v)...(tag_attr_k,tag_**attr_v)}
>>
>> I ended up with this script.
>>
>>
>> ...
>> XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0); --removed top including
>> level bag
>> XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2 as lat, FLATTEN
>> (tag) as (key, value); --flatten the bag of tags
>> XmlTag3 =  FILTER XmlTag2 BY key == 'amenity'; -- get all the records with
>> amenity tags
>> XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build records with all
>> tags containing amenity tag
>> XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2 as lat,$8 as
>> key, $9 as value; -- re-build records : removing redundant field
>> XmlTag5 = GROUP XmlTag7 BY (id,lat,lon); -- re-build records : grouping
>> redundant records
>> XmlTag8 = foreach XmlTag5 { --rebuild records : id,lat,long
>> {(key,value)...(key,value)}
>>          tag = foreach XmlTag7 GENERATE  key, value;
>>      GENERATE group.id,group.lat,group.lon,**tag;
>> };
>>
>> Using this variable:
>>
>> xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat:
>> chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})}
>> XmlTag: {null::node_attr_id: int,null::node_attr_lon:
>> chararray,null::node_attr_lat: chararray,null::tag: {(tag_attr_k:
>> chararray,tag_attr_v: chararray)}}
>> XmlTag2: {id: int,lon: chararray,lat: chararray,key: chararray,value:
>> chararray}
>> XmlTag3: {id: int,lon: chararray,lat: chararray,key: chararray,value:
>> chararray}
>> XmlTag4: {XmlTag3::id: int,XmlTag3::lon: chararray,XmlTag3::lat:
>> chararray,XmlTag3::key: chararray,XmlTag3::value: chararray,XmlTag2::id:
>> int,XmlTag2::lon: chararray,XmlTag2::lat: chararray,XmlTag2::key: