Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: copy chunk of hadoop output


Copy link to this message
-
Re: copy chunk of hadoop output
Hi JM,

I am not sure how "dangerous" it is, since we're using a pipe here,
and as you yourself note, it will only last as long as the last bytes
have been got and then terminate.

The -cat process will terminate because the
process we're piping to will terminate first after it reaches its goal
of -c <N bytes>; so certainly the "-cat" program will not fetch the
whole file down but it may fetch a few bytes extra over communication
due to use of read buffers (the extra data won't be put into the target
file, and get discarded).

We can try it out and observe the "clienttrace" logged
at the DN at the end of the -cat's read. Here's an example:

I wrote a 1.6~ MB file into a file called "foo.jar", see "bytes"
below, its ~1.58 MB:

2013-02-20 23:55:19,777 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/127.0.0.1:58785, dest: /127.0.0.1:50010, bytes: 1658314, op:
HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_915204057_1, offset: 0,
srvID: DS-1092147940-192.168.2.1-50010-1349279636946, blockid:
BP-1461691939-192.168.2.1-1349279623549:blk_2568668834545125596_73870,
duration: 192289000

I ran the command "hadoop fs -cat foo.jar | head -c 5 > foo.xml" to
store first 5 bytes onto a local file:

Asserting that post command we get 5 bytes:
➜  ~ wc -c foo.xml
       5 foo.xml

Asserting that DN didn't IO-read the whole file, see the read op below
and its "bytes" parameter, its only about 193 KB, not the whole block
of 1.58 MB we wrote earlier:

2013-02-21 00:01:32,437 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/127.0.0.1:50010, dest: /127.0.0.1:58802, bytes: 198144, op:
HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-1698829178_1, offset: 0,
srvID: DS-1092147940-192.168.2.1-50010-1349279636946, blockid:
BP-1461691939-192.168.2.1-1349279623549:blk_2568668834545125596_73870,
duration: 19207000

I don't see how this is anymore dangerous than doing a
-copyToLocal/-get, which retrieves the whole file anyway?

On Wed, Feb 20, 2013 at 9:25 PM, Jean-Marc Spaggiari
<[EMAIL PROTECTED]> wrote:
> But be careful.
>
> hadoop fs -cat will retrieve the entire file and last only when it
> will have retrieve the last bytes you are looking for.
>
> If your file is many GB big, it will take a lot of time for this
> command to complete and will put some pressure on your network.
>
> JM
>
> 2013/2/19, jamal sasha <[EMAIL PROTECTED]>:
>> Awesome thanks :)
>>
>>
>> On Tue, Feb 19, 2013 at 2:14 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>>
>>> You can instead use 'fs -cat' and the 'head' coreutil, as one example:
>>>
>>> hadoop fs -cat 100-byte-dfs-file | head -c 5 > 5-byte-local-file
>>>
>>> On Wed, Feb 20, 2013 at 3:38 AM, jamal sasha <[EMAIL PROTECTED]>
>>> wrote:
>>> > Hi,
>>> >   I was wondering in the following command:
>>> >
>>> > bin/hadoop dfs -copyToLocal hdfspath localpath
>>> > can we have specify to copy not full but like xMB's of file to local
>>> drive?
>>> >
>>> > Is something like this possible
>>> > Thanks
>>> > Jamal
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB