I have some very large file in HDFS with 3000+ blocks.
I want run a job with various input size. I want to use the same file as a input. Usually the number of task is equal to number of blocks/splits. Suppose the job with 2 task need to process randomly any two block of the given input file.
How to give a random set of HDFS blocks as a input of a job?
note: my aim is not processing the input file to produce some output. I want to replicate the individual block based on the load.
*Regards* *S.Suresh,* *Research Scholar,* *Department of Computer Applications,* *National Institute of Technology,* *Tiruchirappalli - 620015.* *+91-9941506562*
You can write a custom InputFormat whose #getSplits(...) returns your required InputSplit objects (with randomised offsets + lengths, etc.).
On Fri, Feb 7, 2014 at 9:50 PM, Suresh S <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext