Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> md5sum of files on HDFS ?


+
Scott Golby 2011-03-02, 17:05
+
stu24mail@... 2011-03-02, 17:47
Copy link to this message
-
Re: md5sum of files on HDFS ?
The FileSytem API exposes a getFileChecksum() method
http://archive.cloudera.com/cdh/3/hadoop-0.20.2-CDH3B4/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path)
too.
 However, this isn't the straight-up MD5 of the file.

[1]monster01::groovy-1.7.8(12813)$CLASSPATH=$(hadoop classpath) bin/groovysh
groovy:000> import org.apache.hadoop.fs.FileSystem
===> [import org.apache.hadoop.fs.FileSystem]
groovy:000> import org.apache.hadoop.conf.Configuration
===> [import org.apache.hadoop.fs.FileSystem, import
org.apache.hadoop.conf.Configuration]
groovy:000> import org.apache.hadoop.fs.Path
===> [import org.apache.hadoop.fs.FileSystem, import
org.apache.hadoop.conf.Configuration, import org.apache.hadoop.fs.Path]
groovy:000> fs = FileSystem.get(new Configuration())
===> DFS[DFSClient[clientName=DFSClient_-919136415, ugi=philip]]
groovy:000> fs
===> DFS[DFSClient[clientName=DFSClient_-919136415, ugi=philip]]
groovy:000> fs.getFileChecksum(new Path("/tmp/issue"))
===> MD5-of-0MD5-of-512CRC32:eeec9870219b2381f99ac8ea0c2d0d60
groovy:000>

Whereas:

[0]monster01::~(12845)$hadoop fs -put /etc/issue /tmp/issue
[1]monster01::~(12844)$md5sum /etc/issue
6c9222ee501323045d85545853ebea55  /etc/issue

On Wed, Mar 2, 2011 at 9:47 AM, <[EMAIL PROTECTED]> wrote:

> I don't think there is a built in command. I would just use the java or
> thrift api to read the file & calculate the hash. (thrift + python/ruby/etc)
>
> Take care,
>  -stu
> -----Original Message-----
> From: Scott Golby <[EMAIL PROTECTED]>
> Date: Wed, 2 Mar 2011 11:05:04
> To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>
> Reply-To: [EMAIL PROTECTED]
> Subject: md5sum of files on HDFS ?
>
> Hi Everyone,
>
> How can I do a md5sum/sha1sum directly against files on HDFS ?
>
>
> A pretty common thing I do when archiving files is make an md5sum list
>
> eg)  md5sum /archive/path/* > md5sum-list.txt
>
> Then later should I need to check the files are ok, perhaps before a
> restore, or when I copy them to somewhere else I'll do
> md5sum -c md5sum-list.txt
>
>
> I'd be ok doing it 1 file at a time
>
> java -jar <something> hdfs://some/path/in-hadoop/filename
>
>
> I'm also ok doing it serially through a single node, I've been doing some
> googling and JIRA ticket reading such as
> https://issues.apache.org/jira/browse/HADOOP-3981 and for my use case
> serial read is not a limitation.
>
> What is a bit of a requirement is something I can run as a standard linux
> command on local disk an do 1:1 output comparison.
> eg) Check HDFS md5sum on a file, copyToLocal, re-check local disk md5sum.
>
> Thanks,
> Scott
>
>
>
>
>
>
>
+
Josh Patterson 2011-03-08, 14:36
+
Will Maier 2011-03-02, 18:00
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB