|
Jerome Pierson
2013-01-31, 17:19
Cheolsoo Park
2013-01-31, 19:45
Jonathan Coveney
2013-01-31, 23:27
Jerome Pierson
2013-02-05, 16:06
Cheolsoo Park
2013-02-05, 22:13
Jerome Person
2013-02-05, 22:57
Prashant Kommireddi
2013-02-05, 23:10
Jerome Person
2013-02-06, 10:00
Cheolsoo Park
2013-02-06, 16:41
Jerome Person
2013-02-06, 16:55
psic
2013-02-05, 22:57
|
-
Some optimization advicesJerome Pierson 2013-01-31, 17:19
Hi There,
I am a beginner, I achieved something, but I guess I could have done better. Let me explain. (Pig 0.10) My data is DESCRIBE as : xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat: chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})} and DUMP like this : ((100312088,45.2745669,-12.7776222,{(created_by,JOSM)})) ((100948454,45.2620946,-12.7849171,)) ((100948519,45.2356985,-12.7707014,{(created_by,JOSM)})) ((704398904,45.2416667,-13.0058333,{(lat,-13.00583333),(lon,45.24166667)})) ((1230941976,45.0743117,-12.6888807,{(place,village)})) ((1230941977,45.0832807,-12.6810328,{(name,Mtsahara)})) ((1976927219,45.2272263,-12.7794359,)) ((1751057677,45.2216163,-12.7825896,{(amenity,fast_food),(name,Brochetterie)})) ((1751057678,45.2216953,-12.7829678,{(amenity,fast_food),(name,Brochetterie)})) ((100948360,45.2338541,-12.7762230,{(amenity,ferry_terminal)})) ((362795028,45.2086809,-12.8062991,{(amenity,fuel),(operator,Total)})) I want to extract the record which have a certain value for the tag_attr_k field. For example, give me the record where there is a tag_attr_k = amesity ? That should be : (100948360,-12.7762230,45.2338541,{(amenity,ferry_terminal)}) (362795028,-12.8062991,45.2086809,{(operator,Total),(amenity,fuel)}) (1751057677,-12.7825896,45.2216163,{(amenity,fast_food),(name,Brochetterie)}) (1751057678,-12.7829678,45.2216953,{(amenity,fast_food),(name,Brochetterie)}) So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k, tag_attr_v)...(tag_attr_k,tag_attr_v)} I ended up with this script. ... XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0); --removed top including level bag XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2 as lat, FLATTEN (tag) as (key, value); --flatten the bag of tags XmlTag3 = FILTER XmlTag2 BY key == 'amenity'; -- get all the records with amenity tags XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build records with all tags containing amenity tag XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2 as lat,$8 as key, $9 as value; -- re-build records : removing redundant field XmlTag5 = GROUP XmlTag7 BY (id,lat,lon); -- re-build records : grouping redundant records XmlTag8 = foreach XmlTag5 { --rebuild records : id,lat,long {(key,value)...(key,value)} tag = foreach XmlTag7 GENERATE key, value; GENERATE group.id,group.lat,group.lon,tag; }; Using this variable: xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat: chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})} XmlTag: {null::node_attr_id: int,null::node_attr_lon: chararray,null::node_attr_lat: chararray,null::tag: {(tag_attr_k: chararray,tag_attr_v: chararray)}} XmlTag2: {id: int,lon: chararray,lat: chararray,key: chararray,value: chararray} XmlTag3: {id: int,lon: chararray,lat: chararray,key: chararray,value: chararray} XmlTag4: {XmlTag3::id: int,XmlTag3::lon: chararray,XmlTag3::lat: chararray,XmlTag3::key: chararray,XmlTag3::value: chararray,XmlTag2::id: int,XmlTag2::lon: chararray,XmlTag2::lat: chararray,XmlTag2::key: chararray,XmlTag2::value: chararray} XmlTag7: {id: int,lon: chararray,lat: chararray,key: chararray,value: chararray} XmlTag5: {group: (id: int,lat: chararray,lon: chararray),XmlTag7: {(id: int,lon: chararray,lat: chararray,key: chararray,value: chararray)}} XmlTag8: {id: int,lat: chararray,lon: chararray,tag: {(key: chararray,value: chararray)}} I guess this not very straightforward and can be largely optimized. Please give me some hints ? Regards, J�r�me +
Jerome Pierson 2013-01-31, 17:19
-
Re: Some optimization advicesCheolsoo Park 2013-01-31, 19:45
Hi Jerome,
Try this: XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0); XmlTag2 = FOREACH XmlTag { tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity'); GENERATE *, COUNT(tag_with_amenity) AS count; }; XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE node_attr_id, node_attr_lon, node_attr_lat, tag; Thanks, Cheolsoo On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson <[EMAIL PROTECTED]>wrote: > Hi There, > > I am a beginner, I achieved something, but I guess I could have done > better. Let me explain. > (Pig 0.10) > > My data is DESCRIBE as : > > xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat: > chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})} > > > and DUMP like this : > > ((100312088,45.2745669,-12.**7776222,{(created_by,JOSM)})) > ((100948454,45.2620946,-12.**7849171,)) > ((100948519,45.2356985,-12.**7707014,{(created_by,JOSM)})) > ((704398904,45.2416667,-13.**0058333,{(lat,-13.00583333),(** > lon,45.24166667)})) > ((1230941976,45.0743117,-12.**6888807,{(place,village)})) > ((1230941977,45.0832807,-12.**6810328,{(name,Mtsahara)})) > ((1976927219,45.2272263,-12.**7794359,)) > ((1751057677,45.2216163,-12.**7825896,{(amenity,fast_food),(** > name,Brochetterie)})) > ((1751057678,45.2216953,-12.**7829678,{(amenity,fast_food),(** > name,Brochetterie)})) > ((100948360,45.2338541,-12.**7762230,{(amenity,ferry_**terminal)})) > ((362795028,45.2086809,-12.**8062991,{(amenity,fuel),(**operator,Total)})) > > I want to extract the record which have a certain value for the tag_attr_k > field. For example, give me the record where there is a tag_attr_k > amesity ? That should be : > > (100948360,-12.7762230,45.**2338541,{(amenity,ferry_**terminal)}) > (362795028,-12.8062991,45.**2086809,{(operator,Total),(**amenity,fuel)}) > (1751057677,-12.7825896,45.**2216163,{(amenity,fast_food),(** > name,Brochetterie)}) > (1751057678,-12.7829678,45.**2216953,{(amenity,fast_food),(** > name,Brochetterie)}) > > So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k, > tag_attr_v)...(tag_attr_k,tag_**attr_v)} > > I ended up with this script. > > > ... > XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0); --removed top including > level bag > XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2 as lat, FLATTEN > (tag) as (key, value); --flatten the bag of tags > XmlTag3 = FILTER XmlTag2 BY key == 'amenity'; -- get all the records with > amenity tags > XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build records with all > tags containing amenity tag > XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2 as lat,$8 as > key, $9 as value; -- re-build records : removing redundant field > XmlTag5 = GROUP XmlTag7 BY (id,lat,lon); -- re-build records : grouping > redundant records > XmlTag8 = foreach XmlTag5 { --rebuild records : id,lat,long > {(key,value)...(key,value)} > tag = foreach XmlTag7 GENERATE key, value; > GENERATE group.id,group.lat,group.lon,**tag; > }; > > Using this variable: > > xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat: > chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})} > XmlTag: {null::node_attr_id: int,null::node_attr_lon: > chararray,null::node_attr_lat: chararray,null::tag: {(tag_attr_k: > chararray,tag_attr_v: chararray)}} > XmlTag2: {id: int,lon: chararray,lat: chararray,key: chararray,value: > chararray} > XmlTag3: {id: int,lon: chararray,lat: chararray,key: chararray,value: > chararray} > XmlTag4: {XmlTag3::id: int,XmlTag3::lon: chararray,XmlTag3::lat: > chararray,XmlTag3::key: chararray,XmlTag3::value: chararray,XmlTag2::id: > int,XmlTag2::lon: chararray,XmlTag2::lat: chararray,XmlTag2::key: > chararray,XmlTag2::value: chararray} > XmlTag7: {id: int,lon: chararray,lat: chararray,key: chararray,value: > chararray} > XmlTag5: {group: (id: int,lat: chararray,lon: chararray),XmlTag7: {(id: > int,lon: chararray,lat: chararray,key: chararray,value: chararray)}} > XmlTag8: {id: int,lat: chararray,lon: chararray,tag: {(key: +
Cheolsoo Park 2013-01-31, 19:45
-
Re: Some optimization advicesJonathan Coveney 2013-01-31, 23:27
Even better, push the tag_with_amenity = FILTER tag BY (tag_attr_k ='amenity'); as high as possible.
2013/1/31 Cheolsoo Park <[EMAIL PROTECTED]> > Hi Jerome, > > Try this: > > XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0); > XmlTag2 = FOREACH XmlTag { > tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity'); > GENERATE *, COUNT(tag_with_amenity) AS count; > }; > XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE node_attr_id, > node_attr_lon, node_attr_lat, tag; > > Thanks, > Cheolsoo > > > On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson > <[EMAIL PROTECTED]>wrote: > > > Hi There, > > > > I am a beginner, I achieved something, but I guess I could have done > > better. Let me explain. > > (Pig 0.10) > > > > My data is DESCRIBE as : > > > > xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat: > > chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})} > > > > > > and DUMP like this : > > > > ((100312088,45.2745669,-12.**7776222,{(created_by,JOSM)})) > > ((100948454,45.2620946,-12.**7849171,)) > > ((100948519,45.2356985,-12.**7707014,{(created_by,JOSM)})) > > ((704398904,45.2416667,-13.**0058333,{(lat,-13.00583333),(** > > lon,45.24166667)})) > > ((1230941976,45.0743117,-12.**6888807,{(place,village)})) > > ((1230941977,45.0832807,-12.**6810328,{(name,Mtsahara)})) > > ((1976927219,45.2272263,-12.**7794359,)) > > ((1751057677,45.2216163,-12.**7825896,{(amenity,fast_food),(** > > name,Brochetterie)})) > > ((1751057678,45.2216953,-12.**7829678,{(amenity,fast_food),(** > > name,Brochetterie)})) > > ((100948360,45.2338541,-12.**7762230,{(amenity,ferry_**terminal)})) > > > ((362795028,45.2086809,-12.**8062991,{(amenity,fuel),(**operator,Total)})) > > > > I want to extract the record which have a certain value for the > tag_attr_k > > field. For example, give me the record where there is a tag_attr_k > > amesity ? That should be : > > > > (100948360,-12.7762230,45.**2338541,{(amenity,ferry_**terminal)}) > > (362795028,-12.8062991,45.**2086809,{(operator,Total),(**amenity,fuel)}) > > (1751057677,-12.7825896,45.**2216163,{(amenity,fast_food),(** > > name,Brochetterie)}) > > (1751057678,-12.7829678,45.**2216953,{(amenity,fast_food),(** > > name,Brochetterie)}) > > > > So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k, > > tag_attr_v)...(tag_attr_k,tag_**attr_v)} > > > > I ended up with this script. > > > > > > ... > > XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0); --removed top > including > > level bag > > XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2 as lat, FLATTEN > > (tag) as (key, value); --flatten the bag of tags > > XmlTag3 = FILTER XmlTag2 BY key == 'amenity'; -- get all the records > with > > amenity tags > > XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build records with all > > tags containing amenity tag > > XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2 as lat,$8 as > > key, $9 as value; -- re-build records : removing redundant field > > XmlTag5 = GROUP XmlTag7 BY (id,lat,lon); -- re-build records : grouping > > redundant records > > XmlTag8 = foreach XmlTag5 { --rebuild records : id,lat,long > > {(key,value)...(key,value)} > > tag = foreach XmlTag7 GENERATE key, value; > > GENERATE group.id,group.lat,group.lon,**tag; > > }; > > > > Using this variable: > > > > xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat: > > chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})} > > XmlTag: {null::node_attr_id: int,null::node_attr_lon: > > chararray,null::node_attr_lat: chararray,null::tag: {(tag_attr_k: > > chararray,tag_attr_v: chararray)}} > > XmlTag2: {id: int,lon: chararray,lat: chararray,key: chararray,value: > > chararray} > > XmlTag3: {id: int,lon: chararray,lat: chararray,key: chararray,value: > > chararray} > > XmlTag4: {XmlTag3::id: int,XmlTag3::lon: chararray,XmlTag3::lat: > > chararray,XmlTag3::key: chararray,XmlTag3::value: chararray,XmlTag2::id: > > int,XmlTag2::lon: chararray,XmlTag2::lat: chararray,XmlTag2::key: +
Jonathan Coveney 2013-01-31, 23:27
-
Re: Some optimization advicesJerome Pierson 2013-02-05, 16:06
Thaks a lot. It works fine.
But one more point, I have only one mapper running with this pig job as my cluster has 4 slaves. How could it be different ? Regards, J�r�me Le 31/01/2013 20:45, Cheolsoo Park a �crit : > Hi Jerome, > > Try this: > > XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0); > XmlTag2 = FOREACH XmlTag { > tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity'); > GENERATE *, COUNT(tag_with_amenity) AS count; > }; > XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE node_attr_id, > node_attr_lon, node_attr_lat, tag; > > Thanks, > Cheolsoo > > > On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson > <[EMAIL PROTECTED]>wrote: > >> Hi There, >> >> I am a beginner, I achieved something, but I guess I could have done >> better. Let me explain. >> (Pig 0.10) >> >> My data is DESCRIBE as : >> >> xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat: >> chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})} >> >> >> and DUMP like this : >> >> ((100312088,45.2745669,-12.**7776222,{(created_by,JOSM)})) >> ((100948454,45.2620946,-12.**7849171,)) >> ((100948519,45.2356985,-12.**7707014,{(created_by,JOSM)})) >> ((704398904,45.2416667,-13.**0058333,{(lat,-13.00583333),(** >> lon,45.24166667)})) >> ((1230941976,45.0743117,-12.**6888807,{(place,village)})) >> ((1230941977,45.0832807,-12.**6810328,{(name,Mtsahara)})) >> ((1976927219,45.2272263,-12.**7794359,)) >> ((1751057677,45.2216163,-12.**7825896,{(amenity,fast_food),(** >> name,Brochetterie)})) >> ((1751057678,45.2216953,-12.**7829678,{(amenity,fast_food),(** >> name,Brochetterie)})) >> ((100948360,45.2338541,-12.**7762230,{(amenity,ferry_**terminal)})) >> ((362795028,45.2086809,-12.**8062991,{(amenity,fuel),(**operator,Total)})) >> >> I want to extract the record which have a certain value for the tag_attr_k >> field. For example, give me the record where there is a tag_attr_k >> amesity ? That should be : >> >> (100948360,-12.7762230,45.**2338541,{(amenity,ferry_**terminal)}) >> (362795028,-12.8062991,45.**2086809,{(operator,Total),(**amenity,fuel)}) >> (1751057677,-12.7825896,45.**2216163,{(amenity,fast_food),(** >> name,Brochetterie)}) >> (1751057678,-12.7829678,45.**2216953,{(amenity,fast_food),(** >> name,Brochetterie)}) >> >> So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k, >> tag_attr_v)...(tag_attr_k,tag_**attr_v)} >> >> I ended up with this script. >> >> >> ... >> XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0); --removed top including >> level bag >> XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2 as lat, FLATTEN >> (tag) as (key, value); --flatten the bag of tags >> XmlTag3 = FILTER XmlTag2 BY key == 'amenity'; -- get all the records with >> amenity tags >> XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build records with all >> tags containing amenity tag >> XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2 as lat,$8 as >> key, $9 as value; -- re-build records : removing redundant field >> XmlTag5 = GROUP XmlTag7 BY (id,lat,lon); -- re-build records : grouping >> redundant records >> XmlTag8 = foreach XmlTag5 { --rebuild records : id,lat,long >> {(key,value)...(key,value)} >> tag = foreach XmlTag7 GENERATE key, value; >> GENERATE group.id,group.lat,group.lon,**tag; >> }; >> >> Using this variable: >> >> xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat: >> chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})} >> XmlTag: {null::node_attr_id: int,null::node_attr_lon: >> chararray,null::node_attr_lat: chararray,null::tag: {(tag_attr_k: >> chararray,tag_attr_v: chararray)}} >> XmlTag2: {id: int,lon: chararray,lat: chararray,key: chararray,value: >> chararray} >> XmlTag3: {id: int,lon: chararray,lat: chararray,key: chararray,value: >> chararray} >> XmlTag4: {XmlTag3::id: int,XmlTag3::lon: chararray,XmlTag3::lat: >> chararray,XmlTag3::key: chararray,XmlTag3::value: chararray,XmlTag2::id: >> int,XmlTag2::lon: chararray,XmlTag2::lat: chararray,XmlTag2::key: +
Jerome Pierson 2013-02-05, 16:06
-
Re: Some optimization advicesCheolsoo Park 2013-02-05, 22:13
>> But one more point, I have only one mapper running with this pig job as
my cluster has 4 slaves. How could it be different ? Are you asking why only a single mapper runs even though there are 3 more slaves available? 4 slaves doesn't mean that you will always have 4 mappers/reducers. Hadoop launches a mapper per file split. How many input file do you have? - If you have just one small file, Pig will launch a single mapper. You can increase parallelism by splitting that file into smaller splits: http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop - If you have many small files, Pig will combine them into a single split and launch a single mapper. This case, you might want to change pig.maxCombinedSplitSize: http://pig.apache.org/docs/r0.10.0/perf.html#combine-files Thanks, Cheolsoo On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson <[EMAIL PROTECTED]>wrote: > Thaks a lot. It works fine. > > But one more point, I have only one mapper running with this pig job as my > cluster has 4 slaves. > How could it be different ? > > Regards, > Jérôme > > > Le 31/01/2013 20:45, Cheolsoo Park a écrit : > >> Hi Jerome, >> >> Try this: >> >> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0); >> XmlTag2 = FOREACH XmlTag { >> tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity'); >> GENERATE *, COUNT(tag_with_amenity) AS count; >> }; >> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE node_attr_id, >> node_attr_lon, node_attr_lat, tag; >> >> Thanks, >> Cheolsoo >> >> >> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson >> <[EMAIL PROTECTED]>**wrote: >> >> Hi There, >>> >>> I am a beginner, I achieved something, but I guess I could have done >>> better. Let me explain. >>> (Pig 0.10) >>> >>> My data is DESCRIBE as : >>> >>> xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat: >>> chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})} >>> >>> >>> and DUMP like this : >>> >>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)})) >>> ((100948454,45.2620946,-12.****7849171,)) >>> ((100948519,45.2356985,-12.****7707014,{(created_by,JOSM)})) >>> ((704398904,45.2416667,-13.****0058333,{(lat,-13.00583333),(**** >>> lon,45.24166667)})) >>> ((1230941976,45.0743117,-12.****6888807,{(place,village)})) >>> ((1230941977,45.0832807,-12.****6810328,{(name,Mtsahara)})) >>> ((1976927219,45.2272263,-12.****7794359,)) >>> ((1751057677,45.2216163,-12.****7825896,{(amenity,fast_food),(**** >>> name,Brochetterie)})) >>> ((1751057678,45.2216953,-12.****7829678,{(amenity,fast_food),(**** >>> name,Brochetterie)})) >>> ((100948360,45.2338541,-12.****7762230,{(amenity,ferry_****terminal)})) >>> ((362795028,45.2086809,-12.****8062991,{(amenity,fuel),(**** >>> operator,Total)})) >>> >>> >>> I want to extract the record which have a certain value for the >>> tag_attr_k >>> field. For example, give me the record where there is a tag_attr_k >>> amesity ? That should be : >>> >>> (100948360,-12.7762230,45.****2338541,{(amenity,ferry_****terminal)}) >>> (362795028,-12.8062991,45.****2086809,{(operator,Total),(**** >>> amenity,fuel)}) >>> (1751057677,-12.7825896,45.****2216163,{(amenity,fast_food),(**** >>> name,Brochetterie)}) >>> (1751057678,-12.7829678,45.****2216953,{(amenity,fast_food),(**** >>> name,Brochetterie)}) >>> >>> So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k, >>> tag_attr_v)...(tag_attr_k,tag_****attr_v)} >>> >>> >>> I ended up with this script. >>> >>> >>> ... >>> XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0); --removed top >>> including >>> level bag >>> XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2 as lat, FLATTEN >>> (tag) as (key, value); --flatten the bag of tags >>> XmlTag3 = FILTER XmlTag2 BY key == 'amenity'; -- get all the records >>> with >>> amenity tags >>> XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build records with all >>> tags containing amenity tag >>> XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2 as lat,$8 as +
Cheolsoo Park 2013-02-05, 22:13
-
Re: Some optimization advicesJerome Person 2013-02-05, 22:57
As it is a 50 Gb single file, I believe this job need more than one
mapper. I do not find any mapred.max.split.size parameter in the job configuration xml file (only mapred.min.split.size = 0). Is there any "key word" to activate parallelism into the pig script ? Jérôme. Le Tue, 5 Feb 2013 14:13:32 -0800, Cheolsoo Park <[EMAIL PROTECTED]> a écrit : > >> But one more point, I have only one mapper running with this pig > >> job as > my cluster has 4 slaves. How could it be different ? > > Are you asking why only a single mapper runs even though there are 3 > more slaves available? 4 slaves doesn't mean that you will always > have 4 mappers/reducers. Hadoop launches a mapper per file split. > > How many input file do you have? > > - If you have just one small file, Pig will launch a single mapper. > You can increase parallelism by splitting that file into smaller > splits: > http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop > > - If you have many small files, Pig will combine them into a single > split and launch a single mapper. This case, you might want to change > pig.maxCombinedSplitSize: > http://pig.apache.org/docs/r0.10.0/perf.html#combine-files > > Thanks, > Cheolsoo > > On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson > <[EMAIL PROTECTED]>wrote: > > > Thaks a lot. It works fine. > > > > But one more point, I have only one mapper running with this pig > > job as my cluster has 4 slaves. > > How could it be different ? > > > > Regards, > > Jérôme > > > > > > Le 31/01/2013 20:45, Cheolsoo Park a écrit : > > > >> Hi Jerome, > >> > >> Try this: > >> > >> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0); > >> XmlTag2 = FOREACH XmlTag { > >> tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity'); > >> GENERATE *, COUNT(tag_with_amenity) AS count; > >> }; > >> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE > >> node_attr_id, node_attr_lon, node_attr_lat, tag; > >> > >> Thanks, > >> Cheolsoo > >> > >> > >> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson > >> <[EMAIL PROTECTED]>**wrote: > >> > >> Hi There, > >>> > >>> I am a beginner, I achieved something, but I guess I could have > >>> done better. Let me explain. > >>> (Pig 0.10) > >>> > >>> My data is DESCRIBE as : > >>> > >>> xmlToTuple: {(node_attr_id: int,node_attr_lon: > >>> chararray,node_attr_lat: chararray,tag: {(tag_attr_k: > >>> chararray,tag_attr_v: chararray)})} > >>> > >>> > >>> and DUMP like this : > >>> > >>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)})) > >>> ((100948454,45.2620946,-12.****7849171,)) > >>> ((100948519,45.2356985,-12.****7707014,{(created_by,JOSM)})) > >>> ((704398904,45.2416667,-13.****0058333,{(lat,-13.00583333),(**** > >>> lon,45.24166667)})) > >>> ((1230941976,45.0743117,-12.****6888807,{(place,village)})) > >>> ((1230941977,45.0832807,-12.****6810328,{(name,Mtsahara)})) > >>> ((1976927219,45.2272263,-12.****7794359,)) > >>> ((1751057677,45.2216163,-12.****7825896,{(amenity,fast_food),(**** > >>> name,Brochetterie)})) > >>> ((1751057678,45.2216953,-12.****7829678,{(amenity,fast_food),(**** > >>> name,Brochetterie)})) > >>> ((100948360,45.2338541,-12.****7762230,{(amenity,ferry_****terminal)})) > >>> ((362795028,45.2086809,-12.****8062991,{(amenity,fuel),(**** > >>> operator,Total)})) > >>> > >>> > >>> I want to extract the record which have a certain value for the > >>> tag_attr_k > >>> field. For example, give me the record where there is a > >>> tag_attr_k = amesity ? That should be : > >>> > >>> (100948360,-12.7762230,45.****2338541,{(amenity,ferry_****terminal)}) > >>> (362795028,-12.8062991,45.****2086809,{(operator,Total),(**** > >>> amenity,fuel)}) > >>> (1751057677,-12.7825896,45.****2216163,{(amenity,fast_food),(**** > >>> name,Brochetterie)}) > >>> (1751057678,-12.7829678,45.****2216953,{(amenity,fast_food),(**** > >>> name,Brochetterie)}) > >>> > >>> So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k, > >>> tag_attr_v)...(tag_attr_k,tag_****attr_v)} +
Jerome Person 2013-02-05, 22:57
-
Re: Some optimization advicesPrashant Kommireddi 2013-02-05, 23:10
Is this a gzip file? You have to make sure the compression scheme you use
is splittable for more mappers to be spawned. -Prashant On Tue, Feb 5, 2013 at 2:57 PM, Jerome Person <[EMAIL PROTECTED]>wrote: > As it is a 50 Gb single file, I believe this job need more than one > mapper. > > I do not find any mapred.max.split.size parameter in the job > configuration xml file (only mapred.min.split.size = 0). > > Is there any "key word" to activate parallelism into the pig script ? > > Jérôme. > > Le Tue, 5 Feb 2013 14:13:32 -0800, > Cheolsoo Park <[EMAIL PROTECTED]> a écrit : > > > >> But one more point, I have only one mapper running with this pig > > >> job as > > my cluster has 4 slaves. How could it be different ? > > > > Are you asking why only a single mapper runs even though there are 3 > > more slaves available? 4 slaves doesn't mean that you will always > > have 4 mappers/reducers. Hadoop launches a mapper per file split. > > > > How many input file do you have? > > > > - If you have just one small file, Pig will launch a single mapper. > > You can increase parallelism by splitting that file into smaller > > splits: > > > http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop > > > > - If you have many small files, Pig will combine them into a single > > split and launch a single mapper. This case, you might want to change > > pig.maxCombinedSplitSize: > > http://pig.apache.org/docs/r0.10.0/perf.html#combine-files > > > > Thanks, > > Cheolsoo > > > > On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson > > <[EMAIL PROTECTED]>wrote: > > > > > Thaks a lot. It works fine. > > > > > > But one more point, I have only one mapper running with this pig > > > job as my cluster has 4 slaves. > > > How could it be different ? > > > > > > Regards, > > > Jérôme > > > > > > > > > Le 31/01/2013 20:45, Cheolsoo Park a écrit : > > > > > >> Hi Jerome, > > >> > > >> Try this: > > >> > > >> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0); > > >> XmlTag2 = FOREACH XmlTag { > > >> tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity'); > > >> GENERATE *, COUNT(tag_with_amenity) AS count; > > >> }; > > >> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE > > >> node_attr_id, node_attr_lon, node_attr_lat, tag; > > >> > > >> Thanks, > > >> Cheolsoo > > >> > > >> > > >> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson > > >> <[EMAIL PROTECTED]>**wrote: > > >> > > >> Hi There, > > >>> > > >>> I am a beginner, I achieved something, but I guess I could have > > >>> done better. Let me explain. > > >>> (Pig 0.10) > > >>> > > >>> My data is DESCRIBE as : > > >>> > > >>> xmlToTuple: {(node_attr_id: int,node_attr_lon: > > >>> chararray,node_attr_lat: chararray,tag: {(tag_attr_k: > > >>> chararray,tag_attr_v: chararray)})} > > >>> > > >>> > > >>> and DUMP like this : > > >>> > > >>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)})) > > >>> ((100948454,45.2620946,-12.****7849171,)) > > >>> ((100948519,45.2356985,-12.****7707014,{(created_by,JOSM)})) > > >>> ((704398904,45.2416667,-13.****0058333,{(lat,-13.00583333),(**** > > >>> lon,45.24166667)})) > > >>> ((1230941976,45.0743117,-12.****6888807,{(place,village)})) > > >>> ((1230941977,45.0832807,-12.****6810328,{(name,Mtsahara)})) > > >>> ((1976927219,45.2272263,-12.****7794359,)) > > >>> ((1751057677,45.2216163,-12.****7825896,{(amenity,fast_food),(**** > > >>> name,Brochetterie)})) > > >>> ((1751057678,45.2216953,-12.****7829678,{(amenity,fast_food),(**** > > >>> name,Brochetterie)})) > > >>> > ((100948360,45.2338541,-12.****7762230,{(amenity,ferry_****terminal)})) > > >>> ((362795028,45.2086809,-12.****8062991,{(amenity,fuel),(**** > > >>> operator,Total)})) > > >>> > > >>> > > >>> I want to extract the record which have a certain value for the > > >>> tag_attr_k > > >>> field. For example, give me the record where there is a > > >>> tag_attr_k = amesity ? That should be : > > >>> > > >>> (100948360,-12.7762230,45.****2338541,{(amenity,ferry_****terminal)}) +
Prashant Kommireddi 2013-02-05, 23:10
-
Re: Some optimization advicesJerome Person 2013-02-06, 10:00
It is not a gzip file. It is an XML file which is load with an UDF.
When does pig split the input file. I guess my loader is wrong ? Jérôme. Le Tue, 5 Feb 2013 15:10:14 -0800, Prashant Kommireddi <[EMAIL PROTECTED]> a écrit : > Is this a gzip file? You have to make sure the compression scheme you > use is splittable for more mappers to be spawned. > > -Prashant > > On Tue, Feb 5, 2013 at 2:57 PM, Jerome Person > <[EMAIL PROTECTED]>wrote: > > > As it is a 50 Gb single file, I believe this job need more than one > > mapper. > > > > I do not find any mapred.max.split.size parameter in the job > > configuration xml file (only mapred.min.split.size = 0). > > > > Is there any "key word" to activate parallelism into the pig > > script ? > > > > Jérôme. > > > > Le Tue, 5 Feb 2013 14:13:32 -0800, > > Cheolsoo Park <[EMAIL PROTECTED]> a écrit : > > > > > >> But one more point, I have only one mapper running with this > > > >> pig job as > > > my cluster has 4 slaves. How could it be different ? > > > > > > Are you asking why only a single mapper runs even though there > > > are 3 more slaves available? 4 slaves doesn't mean that you will > > > always have 4 mappers/reducers. Hadoop launches a mapper per file > > > split. > > > > > > How many input file do you have? > > > > > > - If you have just one small file, Pig will launch a single > > > mapper. You can increase parallelism by splitting that file into > > > smaller splits: > > > > > http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop > > > > > > - If you have many small files, Pig will combine them into a > > > single split and launch a single mapper. This case, you might > > > want to change pig.maxCombinedSplitSize: > > > http://pig.apache.org/docs/r0.10.0/perf.html#combine-files > > > > > > Thanks, > > > Cheolsoo > > > > > > On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson > > > <[EMAIL PROTECTED]>wrote: > > > > > > > Thaks a lot. It works fine. > > > > > > > > But one more point, I have only one mapper running with this pig > > > > job as my cluster has 4 slaves. > > > > How could it be different ? > > > > > > > > Regards, > > > > Jérôme > > > > > > > > > > > > Le 31/01/2013 20:45, Cheolsoo Park a écrit : > > > > > > > >> Hi Jerome, > > > >> > > > >> Try this: > > > >> > > > >> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0); > > > >> XmlTag2 = FOREACH XmlTag { > > > >> tag_with_amenity = FILTER tag BY (tag_attr_k => > > >> 'amenity'); GENERATE *, COUNT(tag_with_amenity) AS count; > > > >> }; > > > >> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE > > > >> node_attr_id, node_attr_lon, node_attr_lat, tag; > > > >> > > > >> Thanks, > > > >> Cheolsoo > > > >> > > > >> > > > >> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson > > > >> <[EMAIL PROTECTED]>**wrote: > > > >> > > > >> Hi There, > > > >>> > > > >>> I am a beginner, I achieved something, but I guess I could > > > >>> have done better. Let me explain. > > > >>> (Pig 0.10) > > > >>> > > > >>> My data is DESCRIBE as : > > > >>> > > > >>> xmlToTuple: {(node_attr_id: int,node_attr_lon: > > > >>> chararray,node_attr_lat: chararray,tag: {(tag_attr_k: > > > >>> chararray,tag_attr_v: chararray)})} > > > >>> > > > >>> > > > >>> and DUMP like this : > > > >>> > > > >>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)})) > > > >>> ((100948454,45.2620946,-12.****7849171,)) > > > >>> ((100948519,45.2356985,-12.****7707014,{(created_by,JOSM)})) > > > >>> ((704398904,45.2416667,-13.****0058333,{(lat,-13.00583333),(**** > > > >>> lon,45.24166667)})) > > > >>> ((1230941976,45.0743117,-12.****6888807,{(place,village)})) > > > >>> ((1230941977,45.0832807,-12.****6810328,{(name,Mtsahara)})) > > > >>> ((1976927219,45.2272263,-12.****7794359,)) > > > >>> ((1751057677,45.2216163,-12.****7825896,{(amenity,fast_food),(**** > > > >>> name,Brochetterie)})) > > > >>> ((1751057678,45.2216953,-12.****7829678,{(amenity,fast_food),(**** > > > >>> name,Brochetterie)})) +
Jerome Person 2013-02-06, 10:00
-
Re: Some optimization advicesCheolsoo Park 2013-02-06, 16:41
Hi Jerome,
It's not Pig but Hadoop that splits input files. Pig Load/Store UDFs are associated with InputFormat, OutputFormat and RecordReader classes. Hadoop uses them to decide how to creates splits. Here are more explanations: http://www.quora.com/How-does-Hadoop-handle-split-input-records Thanks, Cheolsoo On Wed, Feb 6, 2013 at 2:00 AM, Jerome Person <[EMAIL PROTECTED]>wrote: > It is not a gzip file. It is an XML file which is load with an UDF. > When does pig split the input file. > I guess my loader is wrong ? > > Jérôme. > > > Le Tue, 5 Feb 2013 15:10:14 -0800, > Prashant Kommireddi <[EMAIL PROTECTED]> a écrit : > > > Is this a gzip file? You have to make sure the compression scheme you > > use is splittable for more mappers to be spawned. > > > > -Prashant > > > > On Tue, Feb 5, 2013 at 2:57 PM, Jerome Person > > <[EMAIL PROTECTED]>wrote: > > > > > As it is a 50 Gb single file, I believe this job need more than one > > > mapper. > > > > > > I do not find any mapred.max.split.size parameter in the job > > > configuration xml file (only mapred.min.split.size = 0). > > > > > > Is there any "key word" to activate parallelism into the pig > > > script ? > > > > > > Jérôme. > > > > > > Le Tue, 5 Feb 2013 14:13:32 -0800, > > > Cheolsoo Park <[EMAIL PROTECTED]> a écrit : > > > > > > > >> But one more point, I have only one mapper running with this > > > > >> pig job as > > > > my cluster has 4 slaves. How could it be different ? > > > > > > > > Are you asking why only a single mapper runs even though there > > > > are 3 more slaves available? 4 slaves doesn't mean that you will > > > > always have 4 mappers/reducers. Hadoop launches a mapper per file > > > > split. > > > > > > > > How many input file do you have? > > > > > > > > - If you have just one small file, Pig will launch a single > > > > mapper. You can increase parallelism by splitting that file into > > > > smaller splits: > > > > > > > > http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop > > > > > > > > - If you have many small files, Pig will combine them into a > > > > single split and launch a single mapper. This case, you might > > > > want to change pig.maxCombinedSplitSize: > > > > http://pig.apache.org/docs/r0.10.0/perf.html#combine-files > > > > > > > > Thanks, > > > > Cheolsoo > > > > > > > > On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson > > > > <[EMAIL PROTECTED]>wrote: > > > > > > > > > Thaks a lot. It works fine. > > > > > > > > > > But one more point, I have only one mapper running with this pig > > > > > job as my cluster has 4 slaves. > > > > > How could it be different ? > > > > > > > > > > Regards, > > > > > Jérôme > > > > > > > > > > > > > > > Le 31/01/2013 20:45, Cheolsoo Park a écrit : > > > > > > > > > >> Hi Jerome, > > > > >> > > > > >> Try this: > > > > >> > > > > >> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0); > > > > >> XmlTag2 = FOREACH XmlTag { > > > > >> tag_with_amenity = FILTER tag BY (tag_attr_k => > > > >> 'amenity'); GENERATE *, COUNT(tag_with_amenity) AS count; > > > > >> }; > > > > >> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE > > > > >> node_attr_id, node_attr_lon, node_attr_lat, tag; > > > > >> > > > > >> Thanks, > > > > >> Cheolsoo > > > > >> > > > > >> > > > > >> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson > > > > >> <[EMAIL PROTECTED]>**wrote: > > > > >> > > > > >> Hi There, > > > > >>> > > > > >>> I am a beginner, I achieved something, but I guess I could > > > > >>> have done better. Let me explain. > > > > >>> (Pig 0.10) > > > > >>> > > > > >>> My data is DESCRIBE as : > > > > >>> > > > > >>> xmlToTuple: {(node_attr_id: int,node_attr_lon: > > > > >>> chararray,node_attr_lat: chararray,tag: {(tag_attr_k: > > > > >>> chararray,tag_attr_v: chararray)})} > > > > >>> > > > > >>> > > > > >>> and DUMP like this : > > > > >>> > > > > >>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)})) > > > > >>> ((100948454,45.2620946,-12.****7849171,)) +
Cheolsoo Park 2013-02-06, 16:41
-
Re: Some optimization advicesJerome Person 2013-02-06, 16:55
Thanks. I will have a look at my InputFormat.
If my InputFormat make one split, there will be only one mapper. Regards, Jérôme. Le Wed, 6 Feb 2013 08:41:50 -0800, Cheolsoo Park <[EMAIL PROTECTED]> a écrit : > Hi Jerome, > > It's not Pig but Hadoop that splits input files. Pig Load/Store UDFs > are associated with InputFormat, OutputFormat and RecordReader > classes. Hadoop uses them to decide how to creates splits. Here are > more explanations: > http://www.quora.com/How-does-Hadoop-handle-split-input-records > > Thanks, > Cheolsoo > > > On Wed, Feb 6, 2013 at 2:00 AM, Jerome Person > <[EMAIL PROTECTED]>wrote: > > > It is not a gzip file. It is an XML file which is load with an UDF. > > When does pig split the input file. > > I guess my loader is wrong ? > > > > Jérôme. > > > > > > Le Tue, 5 Feb 2013 15:10:14 -0800, > > Prashant Kommireddi <[EMAIL PROTECTED]> a écrit : > > > > > Is this a gzip file? You have to make sure the compression scheme > > > you use is splittable for more mappers to be spawned. > > > > > > -Prashant > > > > > > On Tue, Feb 5, 2013 at 2:57 PM, Jerome Person > > > <[EMAIL PROTECTED]>wrote: > > > > > > > As it is a 50 Gb single file, I believe this job need more than > > > > one mapper. > > > > > > > > I do not find any mapred.max.split.size parameter in the job > > > > configuration xml file (only mapred.min.split.size = 0). > > > > > > > > Is there any "key word" to activate parallelism into the pig > > > > script ? > > > > > > > > Jérôme. > > > > > > > > Le Tue, 5 Feb 2013 14:13:32 -0800, > > > > Cheolsoo Park <[EMAIL PROTECTED]> a écrit : > > > > > > > > > >> But one more point, I have only one mapper running with > > > > > >> this pig job as > > > > > my cluster has 4 slaves. How could it be different ? > > > > > > > > > > Are you asking why only a single mapper runs even though there > > > > > are 3 more slaves available? 4 slaves doesn't mean that you > > > > > will always have 4 mappers/reducers. Hadoop launches a mapper > > > > > per file split. > > > > > > > > > > How many input file do you have? > > > > > > > > > > - If you have just one small file, Pig will launch a single > > > > > mapper. You can increase parallelism by splitting that file > > > > > into smaller splits: > > > > > > > > > > > http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop > > > > > > > > > > - If you have many small files, Pig will combine them into a > > > > > single split and launch a single mapper. This case, you might > > > > > want to change pig.maxCombinedSplitSize: > > > > > http://pig.apache.org/docs/r0.10.0/perf.html#combine-files > > > > > > > > > > Thanks, > > > > > Cheolsoo > > > > > > > > > > On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson > > > > > <[EMAIL PROTECTED]>wrote: > > > > > > > > > > > Thaks a lot. It works fine. > > > > > > > > > > > > But one more point, I have only one mapper running with > > > > > > this pig job as my cluster has 4 slaves. > > > > > > How could it be different ? > > > > > > > > > > > > Regards, > > > > > > Jérôme > > > > > > > > > > > > > > > > > > Le 31/01/2013 20:45, Cheolsoo Park a écrit : > > > > > > > > > > > >> Hi Jerome, > > > > > >> > > > > > >> Try this: > > > > > >> > > > > > >> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0); > > > > > >> XmlTag2 = FOREACH XmlTag { > > > > > >> tag_with_amenity = FILTER tag BY (tag_attr_k => > > > > >> 'amenity'); GENERATE *, COUNT(tag_with_amenity) AS count; > > > > > >> }; > > > > > >> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE > > > > > >> node_attr_id, node_attr_lon, node_attr_lat, tag; > > > > > >> > > > > > >> Thanks, > > > > > >> Cheolsoo > > > > > >> > > > > > >> > > > > > >> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson > > > > > >> <[EMAIL PROTECTED]>**wrote: > > > > > >> > > > > > >> Hi There, > > > > > >>> > > > > > >>> I am a beginner, I achieved something, but I guess I could > > > > > >>> have done better. Let me explain. +
Jerome Person 2013-02-06, 16:55
-
Re: Some optimization advicespsic 2013-02-05, 22:57
As it is a 50 Gb single file, I believe this job need more than one
mapper. I do not find any mapred.max.split.size parameter in the job configuration xml file (only mapred.min.split.size = 0). Is there any "key word" to activate parallelism into the pig script ? Jérôme. Le Tue, 5 Feb 2013 14:13:32 -0800, Cheolsoo Park <[EMAIL PROTECTED]> a écrit : > >> But one more point, I have only one mapper running with this pig > >> job as > my cluster has 4 slaves. How could it be different ? > > Are you asking why only a single mapper runs even though there are 3 > more slaves available? 4 slaves doesn't mean that you will always > have 4 mappers/reducers. Hadoop launches a mapper per file split. > > How many input file do you have? > > - If you have just one small file, Pig will launch a single mapper. > You can increase parallelism by splitting that file into smaller > splits: > http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop > > - If you have many small files, Pig will combine them into a single > split and launch a single mapper. This case, you might want to change > pig.maxCombinedSplitSize: > http://pig.apache.org/docs/r0.10.0/perf.html#combine-files > > Thanks, > Cheolsoo > > On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson > <[EMAIL PROTECTED]>wrote: > > > Thaks a lot. It works fine. > > > > But one more point, I have only one mapper running with this pig > > job as my cluster has 4 slaves. > > How could it be different ? > > > > Regards, > > Jérôme > > > > > > Le 31/01/2013 20:45, Cheolsoo Park a écrit : > > > >> Hi Jerome, > >> > >> Try this: > >> > >> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0); > >> XmlTag2 = FOREACH XmlTag { > >> tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity'); > >> GENERATE *, COUNT(tag_with_amenity) AS count; > >> }; > >> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE > >> node_attr_id, node_attr_lon, node_attr_lat, tag; > >> > >> Thanks, > >> Cheolsoo > >> > >> > >> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson > >> <[EMAIL PROTECTED]>**wrote: > >> > >> Hi There, > >>> > >>> I am a beginner, I achieved something, but I guess I could have > >>> done better. Let me explain. > >>> (Pig 0.10) > >>> > >>> My data is DESCRIBE as : > >>> > >>> xmlToTuple: {(node_attr_id: int,node_attr_lon: > >>> chararray,node_attr_lat: chararray,tag: {(tag_attr_k: > >>> chararray,tag_attr_v: chararray)})} > >>> > >>> > >>> and DUMP like this : > >>> > >>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)})) > >>> ((100948454,45.2620946,-12.****7849171,)) > >>> ((100948519,45.2356985,-12.****7707014,{(created_by,JOSM)})) > >>> ((704398904,45.2416667,-13.****0058333,{(lat,-13.00583333),(**** > >>> lon,45.24166667)})) > >>> ((1230941976,45.0743117,-12.****6888807,{(place,village)})) > >>> ((1230941977,45.0832807,-12.****6810328,{(name,Mtsahara)})) > >>> ((1976927219,45.2272263,-12.****7794359,)) > >>> ((1751057677,45.2216163,-12.****7825896,{(amenity,fast_food),(**** > >>> name,Brochetterie)})) > >>> ((1751057678,45.2216953,-12.****7829678,{(amenity,fast_food),(**** > >>> name,Brochetterie)})) > >>> ((100948360,45.2338541,-12.****7762230,{(amenity,ferry_****terminal)})) > >>> ((362795028,45.2086809,-12.****8062991,{(amenity,fuel),(**** > >>> operator,Total)})) > >>> > >>> > >>> I want to extract the record which have a certain value for the > >>> tag_attr_k > >>> field. For example, give me the record where there is a > >>> tag_attr_k = amesity ? That should be : > >>> > >>> (100948360,-12.7762230,45.****2338541,{(amenity,ferry_****terminal)}) > >>> (362795028,-12.8062991,45.****2086809,{(operator,Total),(**** > >>> amenity,fuel)}) > >>> (1751057677,-12.7825896,45.****2216163,{(amenity,fast_food),(**** > >>> name,Brochetterie)}) > >>> (1751057678,-12.7829678,45.****2216953,{(amenity,fast_food),(**** > >>> name,Brochetterie)}) > >>> > >>> So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k, > >>> tag_attr_v)...(tag_attr_k,tag_****attr_v)} +
psic 2013-02-05, 22:57
|