Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Some optimization advices


+
Jerome Pierson 2013-01-31, 17:19
+
Cheolsoo Park 2013-01-31, 19:45
+
Jonathan Coveney 2013-01-31, 23:27
+
Jerome Pierson 2013-02-05, 16:06
Copy link to this message
-
Re: Some optimization advices
>> But one more point, I have only one mapper running with this pig job as
my cluster has 4 slaves. How could it be different ?

Are you asking why only a single mapper runs even though there are 3 more
slaves available? 4 slaves doesn't mean that you will always have 4
mappers/reducers. Hadoop launches a mapper per file split.

How many input file do you have?

- If you have just one small file, Pig will launch a single mapper. You can
increase parallelism by splitting that file into smaller splits:
http://stackoverflow.com/questions/9678180/change-file-split-size-in-hadoop

- If you have many small files, Pig will combine them into a single split
and launch a single mapper. This case, you might want to change
pig.maxCombinedSplitSize:
http://pig.apache.org/docs/r0.10.0/perf.html#combine-files

Thanks,
Cheolsoo

On Tue, Feb 5, 2013 at 8:06 AM, Jerome Pierson
<[EMAIL PROTECTED]>wrote:

> Thaks a lot. It works fine.
>
> But one more point, I have only one mapper running with this pig job as my
> cluster has 4 slaves.
> How could it be different ?
>
> Regards,
> Jérôme
>
>
> Le 31/01/2013 20:45, Cheolsoo Park a écrit :
>
>> Hi Jerome,
>>
>> Try this:
>>
>> XmlTag = FOREACH xmlToTuple GENERATE FLATTEN ($0);
>> XmlTag2 = FOREACH XmlTag {
>>      tag_with_amenity = FILTER tag BY (tag_attr_k == 'amenity');
>>      GENERATE *, COUNT(tag_with_amenity) AS count;
>> };
>> XmlTag3 = FOREACH (FILTER XmlTag2 BY count > 0) GENERATE node_attr_id,
>> node_attr_lon, node_attr_lat, tag;
>>
>> Thanks,
>> Cheolsoo
>>
>>
>> On Thu, Jan 31, 2013 at 9:19 AM, Jerome Pierson
>> <[EMAIL PROTECTED]>**wrote:
>>
>>  Hi There,
>>>
>>> I am a beginner, I achieved something, but I guess I could have done
>>> better. Let me explain.
>>> (Pig 0.10)
>>>
>>> My data is DESCRIBE as :
>>>
>>>   xmlToTuple: {(node_attr_id: int,node_attr_lon: chararray,node_attr_lat:
>>> chararray,tag: {(tag_attr_k: chararray,tag_attr_v: chararray)})}
>>>
>>>
>>> and DUMP like this :
>>>
>>> ((100312088,45.2745669,-12.****7776222,{(created_by,JOSM)}))
>>> ((100948454,45.2620946,-12.****7849171,))
>>> ((100948519,45.2356985,-12.****7707014,{(created_by,JOSM)}))
>>> ((704398904,45.2416667,-13.****0058333,{(lat,-13.00583333),(****
>>> lon,45.24166667)}))
>>> ((1230941976,45.0743117,-12.****6888807,{(place,village)}))
>>> ((1230941977,45.0832807,-12.****6810328,{(name,Mtsahara)}))
>>> ((1976927219,45.2272263,-12.****7794359,))
>>> ((1751057677,45.2216163,-12.****7825896,{(amenity,fast_food),(****
>>> name,Brochetterie)}))
>>> ((1751057678,45.2216953,-12.****7829678,{(amenity,fast_food),(****
>>> name,Brochetterie)}))
>>> ((100948360,45.2338541,-12.****7762230,{(amenity,ferry_****terminal)}))
>>> ((362795028,45.2086809,-12.****8062991,{(amenity,fuel),(****
>>> operator,Total)}))
>>>
>>>
>>> I want to extract the record which have a certain value for the
>>> tag_attr_k
>>> field. For example, give me the record where there is a tag_attr_k >>> amesity ? That should be :
>>>
>>> (100948360,-12.7762230,45.****2338541,{(amenity,ferry_****terminal)})
>>> (362795028,-12.8062991,45.****2086809,{(operator,Total),(****
>>> amenity,fuel)})
>>> (1751057677,-12.7825896,45.****2216163,{(amenity,fast_food),(****
>>> name,Brochetterie)})
>>> (1751057678,-12.7829678,45.****2216953,{(amenity,fast_food),(****
>>> name,Brochetterie)})
>>>
>>> So (node_attr_id, node_attr_lat , node_attr_lon,{(tag_attr_k,
>>> tag_attr_v)...(tag_attr_k,tag_****attr_v)}
>>>
>>>
>>> I ended up with this script.
>>>
>>>
>>> ...
>>> XmlTag = foreach xmlToTuple GENERATE FLATTEN ($0); --removed top
>>> including
>>> level bag
>>> XmlTag2 = foreach XmlTag GENERATE $0 as id, $1 as lon, $2 as lat, FLATTEN
>>> (tag) as (key, value); --flatten the bag of tags
>>> XmlTag3 =  FILTER XmlTag2 BY key == 'amenity'; -- get all the records
>>> with
>>> amenity tags
>>> XmlTag4 = JOIN XmlTag3 BY id, XmlTag2 BY id; -- re-build records with all
>>> tags containing amenity tag
>>> XmlTag7 = foreach XmlTag4 GENERATE $0 as id,$1 as lon, $2 as lat,$8 as
+
Jerome Person 2013-02-05, 22:57
+
Prashant Kommireddi 2013-02-05, 23:10
+
Jerome Person 2013-02-06, 10:00
+
Cheolsoo Park 2013-02-06, 16:41
+
Jerome Person 2013-02-06, 16:55
+
psic 2013-02-05, 22:57
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB