|
tim robertson
2009-04-14, 09:35
Brian Bockelman
2009-04-14, 12:37
tim robertson
2009-04-14, 12:44
tim robertson
2009-04-14, 14:10
Kevin Peterson
2009-04-16, 00:21
tim robertson
2009-04-16, 08:14
tim robertson
2009-04-16, 08:27
tim robertson
2009-04-16, 12:38
Todd Lipcon
2009-04-16, 17:28
tim robertson
2009-04-16, 19:48
Stuart Sierra
2009-04-23, 14:08
Andrew Hitchcock
2009-04-23, 21:02
Stuart Sierra
2009-04-23, 21:45
tim robertson
2009-04-24, 03:46
|
-
Generating many small PNGs to Amazon S3 with MapReducetim robertson 2009-04-14, 09:35
Hi all,
I am currently processing a lot of raw CSV data and producing a summary text file which I load into mysql. On top of this I have a PHP application to generate tiles for google mapping (sample tile: http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800). Here is a (dev server) example of the final map client: http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the dynamic grids as you zoom are all pre-calculated. I am considering (for better throughput as maps generate huge request volumes) pregenerating all my tiles (PNG) and storing them in S3 with cloudfront. There will be billions of PNGs produced each at 1-3KB each. Could someone please recommend the best place to generate the PNGs and when to push them to S3 in a MR system? If I did the PNG generation and upload to S3 in the reduce the same task on multiple machines will compete with each other right? Should I generate the PNGs to a local directory and then on Task success push the lot up? I am assuming billions of 1-3KB files on HDFS is not a good idea. I will use EC2 for the MR for the time being, but this will be moved to a local cluster still pushing to S3... Cheers, Tim
-
Re: Generating many small PNGs to Amazon S3 with MapReduceBrian Bockelman 2009-04-14, 12:37
Hey Tim,
Why don't you put the PNGs in a SequenceFile in the output of your reduce task? You could then have a post-processing step that unpacks the PNG and places it onto S3. (If my numbers are correct, you're looking at around 3TB of data; is this right? With that much, you might want another separate Map task to unpack all the files in parallel ... really depends on the throughput you get to Amazon) Brian On Apr 14, 2009, at 4:35 AM, tim robertson wrote: > Hi all, > > I am currently processing a lot of raw CSV data and producing a > summary text file which I load into mysql. On top of this I have a > PHP application to generate tiles for google mapping (sample tile: > http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800). > Here is a (dev server) example of the final map client: > http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the > dynamic grids as you zoom are all pre-calculated. > > I am considering (for better throughput as maps generate huge request > volumes) pregenerating all my tiles (PNG) and storing them in S3 with > cloudfront. There will be billions of PNGs produced each at 1-3KB > each. > > Could someone please recommend the best place to generate the PNGs and > when to push them to S3 in a MR system? > If I did the PNG generation and upload to S3 in the reduce the same > task on multiple machines will compete with each other right? Should > I generate the PNGs to a local directory and then on Task success push > the lot up? I am assuming billions of 1-3KB files on HDFS is not a > good idea. > > I will use EC2 for the MR for the time being, but this will be moved > to a local cluster still pushing to S3... > > Cheers, > > Tim
-
Re: Generating many small PNGs to Amazon S3 with MapReducetim robertson 2009-04-14, 12:44
Thanks Brian,
This is pretty much what I was looking for. Your calculations are correct but based on the assumption that at all zoom levels we will need all tiles generated. Given the sparsity of data, it actually results in only a few 100GBs. I'll run a second MR job with the map pushing to S3 then to make use of parallel loading. Cheers, Tim On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman <[EMAIL PROTECTED]> wrote: > Hey Tim, > > Why don't you put the PNGs in a SequenceFile in the output of your reduce > task? You could then have a post-processing step that unpacks the PNG and > places it onto S3. (If my numbers are correct, you're looking at around 3TB > of data; is this right? With that much, you might want another separate Map > task to unpack all the files in parallel ... really depends on the > throughput you get to Amazon) > > Brian > > On Apr 14, 2009, at 4:35 AM, tim robertson wrote: > >> Hi all, >> >> I am currently processing a lot of raw CSV data and producing a >> summary text file which I load into mysql. On top of this I have a >> PHP application to generate tiles for google mapping (sample tile: >> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800). >> Here is a (dev server) example of the final map client: >> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the >> dynamic grids as you zoom are all pre-calculated. >> >> I am considering (for better throughput as maps generate huge request >> volumes) pregenerating all my tiles (PNG) and storing them in S3 with >> cloudfront. There will be billions of PNGs produced each at 1-3KB >> each. >> >> Could someone please recommend the best place to generate the PNGs and >> when to push them to S3 in a MR system? >> If I did the PNG generation and upload to S3 in the reduce the same >> task on multiple machines will compete with each other right? Should >> I generate the PNGs to a local directory and then on Task success push >> the lot up? I am assuming billions of 1-3KB files on HDFS is not a >> good idea. >> >> I will use EC2 for the MR for the time being, but this will be moved >> to a local cluster still pushing to S3... >> >> Cheers, >> >> Tim > >
-
Re: Generating many small PNGs to Amazon S3 with MapReducetim robertson 2009-04-14, 14:10
Sorry Brian, can I just ask please...
I have the PNGs in the Sequence file for my sample set. If I use a second MR job and push to S3 in the map, surely I run into the scenario where multiple tasks are running on the same section of the sequence file and thus pushing the same data to S3. Am I missing something obvious (e.g. can I disable this behavior)? Cheers Tim On Tue, Apr 14, 2009 at 2:44 PM, tim robertson <[EMAIL PROTECTED]> wrote: > Thanks Brian, > > This is pretty much what I was looking for. > > Your calculations are correct but based on the assumption that at all > zoom levels we will need all tiles generated. Given the sparsity of > data, it actually results in only a few 100GBs. I'll run a second MR > job with the map pushing to S3 then to make use of parallel loading. > > Cheers, > > Tim > > > On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman <[EMAIL PROTECTED]> wrote: >> Hey Tim, >> >> Why don't you put the PNGs in a SequenceFile in the output of your reduce >> task? You could then have a post-processing step that unpacks the PNG and >> places it onto S3. (If my numbers are correct, you're looking at around 3TB >> of data; is this right? With that much, you might want another separate Map >> task to unpack all the files in parallel ... really depends on the >> throughput you get to Amazon) >> >> Brian >> >> On Apr 14, 2009, at 4:35 AM, tim robertson wrote: >> >>> Hi all, >>> >>> I am currently processing a lot of raw CSV data and producing a >>> summary text file which I load into mysql. On top of this I have a >>> PHP application to generate tiles for google mapping (sample tile: >>> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800). >>> Here is a (dev server) example of the final map client: >>> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the >>> dynamic grids as you zoom are all pre-calculated. >>> >>> I am considering (for better throughput as maps generate huge request >>> volumes) pregenerating all my tiles (PNG) and storing them in S3 with >>> cloudfront. There will be billions of PNGs produced each at 1-3KB >>> each. >>> >>> Could someone please recommend the best place to generate the PNGs and >>> when to push them to S3 in a MR system? >>> If I did the PNG generation and upload to S3 in the reduce the same >>> task on multiple machines will compete with each other right? Should >>> I generate the PNGs to a local directory and then on Task success push >>> the lot up? I am assuming billions of 1-3KB files on HDFS is not a >>> good idea. >>> >>> I will use EC2 for the MR for the time being, but this will be moved >>> to a local cluster still pushing to S3... >>> >>> Cheers, >>> >>> Tim >> >> >
-
Re: Generating many small PNGs to Amazon S3 with MapReduceKevin Peterson 2009-04-16, 00:21
On Tue, Apr 14, 2009 at 2:35 AM, tim robertson <[EMAIL PROTECTED]>wrote:
> > I am considering (for better throughput as maps generate huge request > volumes) pregenerating all my tiles (PNG) and storing them in S3 with > cloudfront. There will be billions of PNGs produced each at 1-3KB > each. > Storing billions of PNGs each at 1-3kb each into S3 will be perfectly fine, there is no need to generate them and then push them at once, if you are storing them each in their own S3 object (which they must be, if you intend to fetch them using cloudfront). Each S3 object is unique, and can be written fully in parallel. If you are writing to the same S3 object twice, ... well, you're doing it wrong. However, do the math on the costs for S3. We were doing something similar, and found that we were spending a fortune on our put requests at $0.01 per 1000, and next to nothing on storage. I've since moved to a more complicated model where I pack many small items in each object and store an index in simpledb. You'll need to partition your SimpleDBs if you do this.
-
Re: Generating many small PNGs to Amazon S3 with MapReducetim robertson 2009-04-16, 08:14
Thanks Kevin,
"... well, you're doing it wrong." This is what I'm afraid of :o) I know the TaskTracker for the Maps for example can run on the same part of the input file but not so sure on the Reduce. In the reduce, will the same keys be run on multiple machines in competition? On Thu, Apr 16, 2009 at 2:21 AM, Kevin Peterson <[EMAIL PROTECTED]> wrote: > On Tue, Apr 14, 2009 at 2:35 AM, tim robertson <[EMAIL PROTECTED]>wrote: > >> >> I am considering (for better throughput as maps generate huge request >> volumes) pregenerating all my tiles (PNG) and storing them in S3 with >> cloudfront. There will be billions of PNGs produced each at 1-3KB >> each. >> > > Storing billions of PNGs each at 1-3kb each into S3 will be perfectly fine, > there is no need to generate them and then push them at once, if you are > storing them each in their own S3 object (which they must be, if you intend > to fetch them using cloudfront). Each S3 object is unique, and can be > written fully in parallel. If you are writing to the same S3 object twice, > ... well, you're doing it wrong. > > However, do the math on the costs for S3. We were doing something similar, > and found that we were spending a fortune on our put requests at $0.01 per > 1000, and next to nothing on storage. I've since moved to a more complicated > model where I pack many small items in each object and store an index in > simpledb. You'll need to partition your SimpleDBs if you do this. >
-
Re: Generating many small PNGs to Amazon S3 with MapReducetim robertson 2009-04-16, 08:27
Hi Chuck,
Thank you very much for this opportunity. I also think it is a nice case study; it goes beyond the typical wordcount example by generating something that people can actually see and play with immediately afterwards (e.g. maps). It is also showcasing nicely the community effort to collectively bring together information on the worlds biodiversity - the GBIF network really is a nice example of a free and open access community who are collectively addressing interoperability globally. Can you please tell me what kind of time frame you would need the case study in? I have just got my Java PNG generation code down to 130msec on the Mac, so I am pretty much ready to start running on EC2 and do the volume tile generation, so will blog the whole thing on http://biodivertido.blogspot.com at some point soon. I have to travel to the US on Saturday for a week so this will delay it somewhat. What is not 100% clear to me is when to push to S3: In the Map I will output the TileId-ZoomLevel-SpeciesId as the key, along with the count, and in the Reduce I group the counts into larger tiles, and create the PNG. I could write to Sequencefile here... but I suspect I could just push to the s3 bucket here also - as long as the task tracker does not send the same Keys to multiple reduce tasks - my Hadoop naivity showing here (I wrote an in memory threaded MapReduceLite which does not compete reducers, but not got into the Hadoop code quite so much yet). Cheers, Tim On Thu, Apr 16, 2009 at 1:49 AM, Chuck Lam <[EMAIL PROTECTED]> wrote: > Hi Tim, > > I'm really interested in your application at gbif.org. I'm in the middle of > writing Hadoop in Action ( http://www.manning.com/lam/ ) and think this may > make for an interesting hadoop case study, since you're taking advantage of > a lot of different pieces (EC2, S3, cloudfront, SequenceFiles, > PHP/streaming). Would you be interested in discussing making a 4-5 page case > study out of this? > > As to your question, I don't know if it's been properly answered, but I > don't know why you think that "multiple tasks are running on the same > section of the sequence file." Maybe you can elaborate further and I'll see > if I can offer any thoughts. > > > > > On Tue, Apr 14, 2009 at 7:10 AM, tim robertson <[EMAIL PROTECTED]> > wrote: >> >> Sorry Brian, can I just ask please... >> >> I have the PNGs in the Sequence file for my sample set. If I use a >> second MR job and push to S3 in the map, surely I run into the >> scenario where multiple tasks are running on the same section of the >> sequence file and thus pushing the same data to S3. Am I missing >> something obvious (e.g. can I disable this behavior)? >> >> Cheers >> >> Tim >> >> >> On Tue, Apr 14, 2009 at 2:44 PM, tim robertson >> <[EMAIL PROTECTED]> wrote: >> > Thanks Brian, >> > >> > This is pretty much what I was looking for. >> > >> > Your calculations are correct but based on the assumption that at all >> > zoom levels we will need all tiles generated. Given the sparsity of >> > data, it actually results in only a few 100GBs. I'll run a second MR >> > job with the map pushing to S3 then to make use of parallel loading. >> > >> > Cheers, >> > >> > Tim >> > >> > >> > On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman <[EMAIL PROTECTED]> >> > wrote: >> >> Hey Tim, >> >> >> >> Why don't you put the PNGs in a SequenceFile in the output of your >> >> reduce >> >> task? You could then have a post-processing step that unpacks the PNG >> >> and >> >> places it onto S3. (If my numbers are correct, you're looking at >> >> around 3TB >> >> of data; is this right? With that much, you might want another >> >> separate Map >> >> task to unpack all the files in parallel ... really depends on the >> >> throughput you get to Amazon) >> >> >> >> Brian >> >> >> >> On Apr 14, 2009, at 4:35 AM, tim robertson wrote: >> >> >> >>> Hi all, >> >>> >> >>> I am currently processing a lot of raw CSV data and producing a >> >>> summary text file which I load into mysql. On top of this I have a
-
Re: Generating many small PNGs to Amazon S3 with MapReducetim robertson 2009-04-16, 12:38
> However, do the math on the costs for S3. We were doing something similar,
> and found that we were spending a fortune on our put requests at $0.01 per > 1000, and next to nothing on storage. I've since moved to a more complicated > model where I pack many small items in each object and store an index in > simpledb. You'll need to partition your SimpleDBs if you do this. Thanks a lot for Kevin for this - I stupidly overlooked the S3 put cost thinking EC2->S3 transfer was free, without realising there is still a PUT cost... I will reconsider and look at copying your approach and compare it with a few rendering EC2 instances running off mysql or so. Thanks again. Tim
-
Re: Generating many small PNGs to Amazon S3 with MapReduceTodd Lipcon 2009-04-16, 17:28
On Thu, Apr 16, 2009 at 1:27 AM, tim robertson <[EMAIL PROTECTED]>wrote:
> > What is not 100% clear to me is when to push to S3: > In the Map I will output the TileId-ZoomLevel-SpeciesId as the key, > along with the count, and in the Reduce I group the counts into larger > tiles, and create the PNG. I could write to Sequencefile here... but > I suspect I could just push to the s3 bucket here also - as long as > the task tracker does not send the same Keys to multiple reduce tasks > - my Hadoop naivity showing here (I wrote an in memory threaded > MapReduceLite which does not compete reducers, but not got into the > Hadoop code quite so much yet). > > Hi Tim, If I understand what you mean by "compete reducers", then you're referring to the feature called "speculative execution", in which Hadoop schedules multiple TaskTrackers to perform the same task. When one of the multiply-scheduled tasks finishes, the other one is killed. As you seem to already understand, this might cause issues if your tasks have non-idempotent side effects on the outside world. The configuration variable you need to look at is mapred.reduce.tasks.speculative.execution. If this is set to false, only one reduce task will be run on each key. If it is true, it's possible that some reduce tasks will be scheduled twice to try to reduce variance in job completion times due to slow machines. There's an equivalent configuration variable mapred.map.tasks.speculative.execution that controls this behavior for your map tasks. Hope that helps, -Todd
-
Re: Generating many small PNGs to Amazon S3 with MapReducetim robertson 2009-04-16, 19:48
Thanks Todd and Chuck - sorry, my terminology was wrong... exactly
what I was looking for. I am letting mysql chuck throught the zoom levels now to get some final numbers on the tiles and cost to S3 PUT. Looks like zoom level 8 is feasible for our current data volume but not a long term option if the input data explodes in volume. Cheers, Tim On Thu, Apr 16, 2009 at 9:05 PM, Chuck Lam <[EMAIL PROTECTED]> wrote: > ar.. i totally missed the point you had said about "compete reducers". it > didn't occur to me that you were talking about hadoop's speculative > execution. todd's solution to turn off speculative execution is correct. > > i'll respond to the rest of your email later today. > > > > On Thu, Apr 16, 2009 at 5:23 AM, tim robertson <[EMAIL PROTECTED]> > wrote: >> >> Thanks Chuck, >> >> > I'm shooting for finishing the case studies by the end of May, but it'll >> > be >> > nice to have a draft done by mid-May so we can edit it to have a >> > consistent >> > style with the other case studies. >> >> I will do what I can! >> >> > I read your blog and found a couple posts on spatial joining. It wasn't >> > clear to me from reading the posts whether the work was just >> > experimental or >> > if it led to some application. If it led to an application, then we may >> > incorporate that into the case study too. >> >> It led to http://widgets.gbif.org/test/PACountry.html#/area/2571 which >> shows a statistical summary for our data (latitude longitude) >> cross-referenced with the polygons on the protected areas of the >> world. In truth though, we processed it in PostGIS and Hadoop and >> found that the PostGIS approach, while way slower was fine for now and >> we developed the scripts for that quicker. So you can say it was >> experimental... I do have ambitions to do a basic geospatial join >> (points in polygons) for PIG, Cloudbase or Hive2.0 but alas have not >> found time. Also - the blog is always a late Sunday night effort so >> really is not written well. >> >> > BTW, where in the US are you traveling to? I'm in Silicon Valley, so >> > maybe >> > we can meet up if you'll happen to be in the area and can squeeze a >> > little >> > time out. >> >> Would have loved to... but in Boston and DC this time. In a few weeks >> will be in Chicago, but for some reason I have never make it over your >> neck of the woods. >> >> > I don't know what data you need to produce a single PNG file, so I don't >> > know whether having map output TileId-ZoomLevel-SpeciesId as key is the >> > right factoring. To me it looks like each PNG represents one tile at one >> > zoom level but includes multiple species. >> >> We do individual species and higher levels of taxa (up to all data). >> This is all data, grouped to 1x1 degree cells (think 100x100 km) with >> counts. Currently preprocessed with mysql, but another hadoop >> candidate as we grow. >> >> http://maps.gbif.org/mapserver/draw.pl?dtype=box&imgonly=1&path=http%3A%2F%2Fdata.gbif.org%2Fmaplayer%2Ftaxon%2F13140803&extent=-180.0+-90.0+180.0+90.0&mode=browse&refresh=Refresh&layer=countryborders >> >> > In any case, under Hadoop/MapReduce, all key/value pairs outputted by >> > the >> > mappers are grouped by key before being sent to the reducer, so it's >> > guaranteed that the same key will not go to multiple reducers. >> >> That is good to know. I knew Map tasks would get run on multiple >> machines if it detects a machine is idle, but wasn't sure if Hadoop >> would put reducers on machines to compete against each other and kill >> the one that did not finish first. >> >> > You may also want to think more about the actual volume and cost of all >> > this. You initially said that you will have "billions of PNGs produced >> > each >> > at 1-3KB" but then later said the data size is only a few 100GB due to >> > sparsity. Either you're not really creating billions of PNGs, or a lot >> > of >> > them are actually less than 1KB. Kevin brought up a good point that S3 >> > charges $0.01 for every 1000 files ("objects") created, so generating 1
-
Re: Generating many small PNGs to Amazon S3 with MapReduceStuart Sierra 2009-04-23, 14:08
On Wed, Apr 15, 2009 at 8:21 PM, Kevin Peterson <[EMAIL PROTECTED]> wrote:
> However, do the math on the costs for S3. We were doing something similar, > and found that we were spending a fortune on our put requests at $0.01 per > 1000, and next to nothing on storage. I made a similar discovery. The cost of PUT adds up fast. One billion PUTs will cost you $10 million! -Stuart Sierra
-
Re: Generating many small PNGs to Amazon S3 with MapReduceAndrew Hitchcock 2009-04-23, 21:02
How do you figure? Puts are one penny per thousand, so I think it'd
only cost $10,000. Here's the math I'm using: 1 billion * ($0.01 / 1000) = 10,000 Math courtesy of Google: http://www.google.com/search?q=1+billion+*+(0.01+%2F+1000) Still expensive, but not unreasonably so. Andrew On Thu, Apr 23, 2009 at 7:08 AM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > On Wed, Apr 15, 2009 at 8:21 PM, Kevin Peterson <[EMAIL PROTECTED]> wrote: >> However, do the math on the costs for S3. We were doing something similar, >> and found that we were spending a fortune on our put requests at $0.01 per >> 1000, and next to nothing on storage. > > I made a similar discovery. The cost of PUT adds up fast. One > billion PUTs will cost you $10 million! > > -Stuart Sierra >
-
Re: Generating many small PNGs to Amazon S3 with MapReduceStuart Sierra 2009-04-23, 21:45
On Thu, Apr 23, 2009 at 5:02 PM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote:
> 1 billion * ($0.01 / 1000) = 10,000 Oh yeah, I was thinking $0.01 for a single PUT. Silly me. -S
-
Re: Generating many small PNGs to Amazon S3 with MapReducetim robertson 2009-04-24, 03:46
If anyone is interested I did finally get round to processing it all,
and due to the sparsity of data we have, for all 23 zoom levels and all species we have information on, the result was 807 million PNGs, which is $8,000 to PUT to S3 - too much for me to pay. So like most things I will probably go for a compromise and pre process 10 zoom levels into S3 which will only come in at $457 (only the PUT into S3) and then render the rest on the fly. Only people browsing beyond zoom 10 are then hitting the real time rendering servers so I think this will work out ok performance wise. Cheers, Tim On Thu, Apr 23, 2009 at 5:45 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > On Thu, Apr 23, 2009 at 5:02 PM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote: >> 1 billion * ($0.01 / 1000) = 10,000 > > Oh yeah, I was thinking $0.01 for a single PUT. Silly me. > > -S > |