|
|
Hrishikesh Agashe 2009-11-14, 16:25
Hi,
Default DFS block size is 64 MB. Does this mean that if I put file less than 64 MB on HDFS, it will not be divided any further? I have lots and lots if XMLs and I would like to process them directly. Currently I am converting them to Sequence files (10 XMLs per sequence file) and the putting them on HDFS. However creating sequence files is very time consuming process. So if I just ensure that all XMLs are less than 64 MB (or value of dfs.block.size), they will not be split and I can safely process them in map / reduce using SAX parser?
If this is not possible, is there a way to speed up sequence file creation process?
DISCLAIMER =========This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Amogh Vasekar 2009-11-14, 21:07
Replies inline.
On 11/14/09 9:55 PM, "Hrishikesh Agashe" <[EMAIL PROTECTED]> wrote:
Hi,
Default DFS block size is 64 MB. Does this mean that if I put file less than 64 MB on HDFS, it will not be divided any further?
--Yes, file will be stored in single block per replica.
I have lots and lots if XMLs and I would like to process them directly. Currently I am converting them to Sequence files (10 XMLs per sequence file) and the putting them on HDFS. However creating sequence files is very time consuming process. So if I just ensure that all XMLs are less than 64 MB (or value of dfs.block.size), they will not be split and I can safely process them in map / reduce using SAX parser?
--True, but too many small files is generally not recommended, since they eat up into NN resources and add overhead to mapred jobs, along with other issues discussed previously in this forum. Cloudera has a pretty detailed blog on this. Alternatively, you can also define the split size to be used in your map-red code using configuration parameter mapred.min.split.size ( doesn't work with all formats :| ) . For XML, there is a streamxml or something similar named format you may want to consider.
Thanks, Amogh
If this is not possible, is there a way to speed up sequence file creation process?
DISCLAIMER =========This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Mohit Anchlia 2012-02-26, 01:43
If I want to change the block size then can I use Configuration in mapreduce job and set it when writing to the sequence file or does it need to be cluster wide setting in .xml files?
Also, is there a way to check the block of a given file?
|
|