I am researching a Hadoop solution for an existing application that requires a directory structure full of data for processing.
To make the Hadoop solution work I need to deploy the data directory to each DN when the job is executed.I know this isn't new and commonly done with a Distributed Cache.
Based on experience what are the common file sizes deployed in a Distributed Cache? I know smaller is better, but how big is too big? the larger cache deployed I have read there will be startup latency. I also assume there are other factors that play into this.
I know that->Default local.cache.size=10Gb
-Range of desirable sizes for Distributed Cache= 10Kb - 1Gb??-Distributed Cache is normally not used if larger than =____?
Another Option: Put the data directories on each DN and provide location to TaskTracker?