|
Aniket Mokashi
2013-02-20, 21:18
Johnny Zhang
2013-02-22, 02:30
Aniket Mokashi
2013-02-22, 03:02
Prashant Kommireddi
2013-02-22, 03:07
Aniket Mokashi
2013-02-22, 08:42
Jonathan Coveney
2013-02-22, 09:17
|
-
Replicated join: is there a setting to make this better?Aniket Mokashi 2013-02-20, 21:18
Hi devs,
I was looking into limitations of size/records for fragment replicated join (map join) in pig. To test that I loaded a map (aka fragment) of longs in an alias to join it with other alias which had few other columns. With a map file of 50mb I saw GC Overheads on the mappers. I took a heap dump of mapper to look into whats causing the GC Overheads and found that its the memory footprint of fragment itself was high. [image: Inline image 1] Note, the hashmap was able to only load about 1.8 million records- [image: Inline image 2] Reason was that every map record has an overhead of about 1.5kb. Most of it is part of retained heap, but it needs to be garbage collected. [image: Inline image 3] So, it turns out- Size of heap required by a map join from above = 1.5 KB * Number of records + Size of input (uncompressed databytearray)... (assuming the key is a long). So, to run your replicated join, you need to satisfy following criteria: *1.5 KB * Number of records + Size of input (uncompressed) < estimated free memory in the mapper (total heap - io.sort.mb - some minor constant etc.)* Is that a right conclusion? Is there a setting/way to make this better? Thanks, Aniket * *
-
Re: Replicated join: is there a setting to make this better?Johnny Zhang 2013-02-22, 02:30
Hi, Aniket:
your image is blank :) not sure if this only happens to me though. Johnny On Thu, Feb 21, 2013 at 6:08 PM, Aniket Mokashi <[EMAIL PROTECTED]> wrote: > I think the email was filtered out. Resending. > > > ---------- Forwarded message ---------- > From: Aniket Mokashi <[EMAIL PROTECTED]> > Date: Wed, Feb 20, 2013 at 1:18 PM > Subject: Replicated join: is there a setting to make this better? > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > > Hi devs, > > I was looking into limitations of size/records for fragment replicated > join (map join) in pig. To test that I loaded a map (aka fragment) of longs > in an alias to join it with other alias which had few other columns. With a > map file of 50mb I saw GC Overheads on the mappers. I took a heap dump of > mapper to look into whats causing the GC Overheads and found that its the > memory footprint of fragment itself was high. > > [image: Inline image 1] > > Note, the hashmap was able to only load about 1.8 million records- > [image: Inline image 2] > Reason was that every map record has an overhead of about 1.5kb. Most of > it is part of retained heap, but it needs to be garbage collected. > [image: Inline image 3] > > So, it turns out- > > Size of heap required by a map join from above = 1.5 KB * Number of > records + Size of input (uncompressed databytearray)... (assuming the key > is a long). > > So, to run your replicated join, you need to satisfy following criteria: > > *1.5 KB * Number of records + Size of input (uncompressed) < estimated > free memory in the mapper (total heap - io.sort.mb - some minor constant > etc.)* > > Is that a right conclusion? Is there a setting/way to make this better? > > Thanks, > > Aniket > > * > * > > > > -- > "...:::Aniket:::... Quetzalco@tl" >
-
Re: Replicated join: is there a setting to make this better?Aniket Mokashi 2013-02-22, 03:02
Thanks Johnny. I am not sure how to post these images on mailing lists! :(
On Thu, Feb 21, 2013 at 6:30 PM, Johnny Zhang <[EMAIL PROTECTED]> wrote: > Hi, Aniket: > your image is blank :) not sure if this only happens to me though. > > Johnny > > > On Thu, Feb 21, 2013 at 6:08 PM, Aniket Mokashi <[EMAIL PROTECTED]> > wrote: > > > I think the email was filtered out. Resending. > > > > > > ---------- Forwarded message ---------- > > From: Aniket Mokashi <[EMAIL PROTECTED]> > > Date: Wed, Feb 20, 2013 at 1:18 PM > > Subject: Replicated join: is there a setting to make this better? > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > > > > > Hi devs, > > > > I was looking into limitations of size/records for fragment replicated > > join (map join) in pig. To test that I loaded a map (aka fragment) of > longs > > in an alias to join it with other alias which had few other columns. > With a > > map file of 50mb I saw GC Overheads on the mappers. I took a heap dump of > > mapper to look into whats causing the GC Overheads and found that its the > > memory footprint of fragment itself was high. > > > > [image: Inline image 1] > > > > Note, the hashmap was able to only load about 1.8 million records- > > [image: Inline image 2] > > Reason was that every map record has an overhead of about 1.5kb. Most of > > it is part of retained heap, but it needs to be garbage collected. > > [image: Inline image 3] > > > > So, it turns out- > > > > Size of heap required by a map join from above = 1.5 KB * Number of > > records + Size of input (uncompressed databytearray)... (assuming the key > > is a long). > > > > So, to run your replicated join, you need to satisfy following criteria: > > > > *1.5 KB * Number of records + Size of input (uncompressed) < estimated > > free memory in the mapper (total heap - io.sort.mb - some minor constant > > etc.)* > > > > Is that a right conclusion? Is there a setting/way to make this better? > > > > Thanks, > > > > Aniket > > > > * > > * > > > > > > > > -- > > "...:::Aniket:::... Quetzalco@tl" > > > -- "...:::Aniket:::... Quetzalco@tl"
-
Re: Replicated join: is there a setting to make this better?Prashant Kommireddi 2013-02-22, 03:07
Mailing lists don't support attachments. Is JIRA a place we can discuss
this? Based on the outcome we could either classify it an improvement/bug or "Not a Problem" ? -Prashant On Thu, Feb 21, 2013 at 7:02 PM, Aniket Mokashi <[EMAIL PROTECTED]> wrote: > Thanks Johnny. I am not sure how to post these images on mailing lists! :( > > > On Thu, Feb 21, 2013 at 6:30 PM, Johnny Zhang <[EMAIL PROTECTED]> > wrote: > > > Hi, Aniket: > > your image is blank :) not sure if this only happens to me though. > > > > Johnny > > > > > > On Thu, Feb 21, 2013 at 6:08 PM, Aniket Mokashi <[EMAIL PROTECTED]> > > wrote: > > > > > I think the email was filtered out. Resending. > > > > > > > > > ---------- Forwarded message ---------- > > > From: Aniket Mokashi <[EMAIL PROTECTED]> > > > Date: Wed, Feb 20, 2013 at 1:18 PM > > > Subject: Replicated join: is there a setting to make this better? > > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > > > > > > > > Hi devs, > > > > > > I was looking into limitations of size/records for fragment replicated > > > join (map join) in pig. To test that I loaded a map (aka fragment) of > > longs > > > in an alias to join it with other alias which had few other columns. > > With a > > > map file of 50mb I saw GC Overheads on the mappers. I took a heap dump > of > > > mapper to look into whats causing the GC Overheads and found that its > the > > > memory footprint of fragment itself was high. > > > > > > [image: Inline image 1] > > > > > > Note, the hashmap was able to only load about 1.8 million records- > > > [image: Inline image 2] > > > Reason was that every map record has an overhead of about 1.5kb. Most > of > > > it is part of retained heap, but it needs to be garbage collected. > > > [image: Inline image 3] > > > > > > So, it turns out- > > > > > > Size of heap required by a map join from above = 1.5 KB * Number of > > > records + Size of input (uncompressed databytearray)... (assuming the > key > > > is a long). > > > > > > So, to run your replicated join, you need to satisfy following > criteria: > > > > > > *1.5 KB * Number of records + Size of input (uncompressed) < estimated > > > free memory in the mapper (total heap - io.sort.mb - some minor > constant > > > etc.)* > > > > > > Is that a right conclusion? Is there a setting/way to make this better? > > > > > > Thanks, > > > > > > Aniket > > > > > > * > > > * > > > > > > > > > > > > -- > > > "...:::Aniket:::... Quetzalco@tl" > > > > > > > > > -- > "...:::Aniket:::... Quetzalco@tl" >
-
Re: Replicated join: is there a setting to make this better?Aniket Mokashi 2013-02-22, 08:42
Interesting, I found this in 0.11 documentation:
Fragment replicate joins are experimental; we don't have a strong sense of how small the small relation must be to fit into memory. In our tests with a simple query that involves just a JOIN, a relation of up to 100 M can be used if the process overall gets 1 GB of memory. Please share your observations and experience with us. Let me open a jira to share some of the experience I have with this or do we already have one? ~Aniket On Thu, Feb 21, 2013 at 7:07 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote: > Mailing lists don't support attachments. Is JIRA a place we can discuss > this? Based on the outcome we could either classify it an improvement/bug > or "Not a Problem" ? > > -Prashant > > On Thu, Feb 21, 2013 at 7:02 PM, Aniket Mokashi <[EMAIL PROTECTED]> > wrote: > > > Thanks Johnny. I am not sure how to post these images on mailing lists! > :( > > > > > > On Thu, Feb 21, 2013 at 6:30 PM, Johnny Zhang <[EMAIL PROTECTED]> > > wrote: > > > > > Hi, Aniket: > > > your image is blank :) not sure if this only happens to me though. > > > > > > Johnny > > > > > > > > > On Thu, Feb 21, 2013 at 6:08 PM, Aniket Mokashi <[EMAIL PROTECTED]> > > > wrote: > > > > > > > I think the email was filtered out. Resending. > > > > > > > > > > > > ---------- Forwarded message ---------- > > > > From: Aniket Mokashi <[EMAIL PROTECTED]> > > > > Date: Wed, Feb 20, 2013 at 1:18 PM > > > > Subject: Replicated join: is there a setting to make this better? > > > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > > > > > > > > > > > Hi devs, > > > > > > > > I was looking into limitations of size/records for fragment > replicated > > > > join (map join) in pig. To test that I loaded a map (aka fragment) of > > > longs > > > > in an alias to join it with other alias which had few other columns. > > > With a > > > > map file of 50mb I saw GC Overheads on the mappers. I took a heap > dump > > of > > > > mapper to look into whats causing the GC Overheads and found that its > > the > > > > memory footprint of fragment itself was high. > > > > > > > > [image: Inline image 1] > > > > > > > > Note, the hashmap was able to only load about 1.8 million records- > > > > [image: Inline image 2] > > > > Reason was that every map record has an overhead of about 1.5kb. Most > > of > > > > it is part of retained heap, but it needs to be garbage collected. > > > > [image: Inline image 3] > > > > > > > > So, it turns out- > > > > > > > > Size of heap required by a map join from above = 1.5 KB * Number of > > > > records + Size of input (uncompressed databytearray)... (assuming the > > key > > > > is a long). > > > > > > > > So, to run your replicated join, you need to satisfy following > > criteria: > > > > > > > > *1.5 KB * Number of records + Size of input (uncompressed) < > estimated > > > > free memory in the mapper (total heap - io.sort.mb - some minor > > constant > > > > etc.)* > > > > > > > > Is that a right conclusion? Is there a setting/way to make this > better? > > > > > > > > Thanks, > > > > > > > > Aniket > > > > > > > > * > > > > * > > > > > > > > > > > > > > > > -- > > > > "...:::Aniket:::... Quetzalco@tl" > > > > > > > > > > > > > > > -- > > "...:::Aniket:::... Quetzalco@tl" > > > -- "...:::Aniket:::... Quetzalco@tl"
-
Re: Replicated join: is there a setting to make this better?Jonathan Coveney 2013-02-22, 09:17
One quick way to vastly improve the memory efficiency is to utilize the
SchemaTuple addition. https://issues.apache.org/jira/browse/PIG-2359 This should cut memory use in half, at least. 2013/2/22 Aniket Mokashi <[EMAIL PROTECTED]> > Interesting, I found this in 0.11 documentation: > > Fragment replicate joins are experimental; we don't have a strong sense of > how small the small relation must be to fit into memory. In our tests with > a simple query that involves just a JOIN, a relation of up to 100 M can be > used if the process overall gets 1 GB of memory. Please share your > observations and experience with us. > > Let me open a jira to share some of the experience I have with this or do > we already have one? > > ~Aniket > > > On Thu, Feb 21, 2013 at 7:07 PM, Prashant Kommireddi <[EMAIL PROTECTED] > >wrote: > > > Mailing lists don't support attachments. Is JIRA a place we can discuss > > this? Based on the outcome we could either classify it an improvement/bug > > or "Not a Problem" ? > > > > -Prashant > > > > On Thu, Feb 21, 2013 at 7:02 PM, Aniket Mokashi <[EMAIL PROTECTED]> > > wrote: > > > > > Thanks Johnny. I am not sure how to post these images on mailing lists! > > :( > > > > > > > > > On Thu, Feb 21, 2013 at 6:30 PM, Johnny Zhang <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Hi, Aniket: > > > > your image is blank :) not sure if this only happens to me though. > > > > > > > > Johnny > > > > > > > > > > > > On Thu, Feb 21, 2013 at 6:08 PM, Aniket Mokashi <[EMAIL PROTECTED] > > > > > > wrote: > > > > > > > > > I think the email was filtered out. Resending. > > > > > > > > > > > > > > > ---------- Forwarded message ---------- > > > > > From: Aniket Mokashi <[EMAIL PROTECTED]> > > > > > Date: Wed, Feb 20, 2013 at 1:18 PM > > > > > Subject: Replicated join: is there a setting to make this better? > > > > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > > > > > > > > > > > > > > Hi devs, > > > > > > > > > > I was looking into limitations of size/records for fragment > > replicated > > > > > join (map join) in pig. To test that I loaded a map (aka fragment) > of > > > > longs > > > > > in an alias to join it with other alias which had few other > columns. > > > > With a > > > > > map file of 50mb I saw GC Overheads on the mappers. I took a heap > > dump > > > of > > > > > mapper to look into whats causing the GC Overheads and found that > its > > > the > > > > > memory footprint of fragment itself was high. > > > > > > > > > > [image: Inline image 1] > > > > > > > > > > Note, the hashmap was able to only load about 1.8 million records- > > > > > [image: Inline image 2] > > > > > Reason was that every map record has an overhead of about 1.5kb. > Most > > > of > > > > > it is part of retained heap, but it needs to be garbage collected. > > > > > [image: Inline image 3] > > > > > > > > > > So, it turns out- > > > > > > > > > > Size of heap required by a map join from above = 1.5 KB * Number of > > > > > records + Size of input (uncompressed databytearray)... (assuming > the > > > key > > > > > is a long). > > > > > > > > > > So, to run your replicated join, you need to satisfy following > > > criteria: > > > > > > > > > > *1.5 KB * Number of records + Size of input (uncompressed) < > > estimated > > > > > free memory in the mapper (total heap - io.sort.mb - some minor > > > constant > > > > > etc.)* > > > > > > > > > > Is that a right conclusion? Is there a setting/way to make this > > better? > > > > > > > > > > Thanks, > > > > > > > > > > Aniket > > > > > > > > > > * > > > > > * > > > > > > > > > > > > > > > > > > > > -- > > > > > "...:::Aniket:::... Quetzalco@tl" > > > > > > > > > > > > > > > > > > > > > -- > > > "...:::Aniket:::... Quetzalco@tl" > > > > > > > > > -- > "...:::Aniket:::... Quetzalco@tl" > |