Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> copy chunk of hadoop output


Copy link to this message
-
Re: copy chunk of hadoop output
Hi Harsh,

My bad.

I read the example quickly and I don't know why I tought you used tail
and not head.

head will work perfectly. But tail will not since it will need to read
the entier file. My comment was for tail, not for head, and therefore
not application to the example you gave.
hadoop fs -cat 100-byte-dfs-file | tail -c 5 > 5-byte-local-file

Will have to download the entire file.

Is there a way to "jump" into a certain position in a file and "cat" from there?

JM

2013/2/20, Harsh J <[EMAIL PROTECTED]>:
> Hi JM,
>
> I am not sure how "dangerous" it is, since we're using a pipe here,
> and as you yourself note, it will only last as long as the last bytes
> have been got and then terminate.
>
> The -cat process will terminate because the
> process we're piping to will terminate first after it reaches its goal
> of -c <N bytes>; so certainly the "-cat" program will not fetch the
> whole file down but it may fetch a few bytes extra over communication
> due to use of read buffers (the extra data won't be put into the target
> file, and get discarded).
>
> We can try it out and observe the "clienttrace" logged
> at the DN at the end of the -cat's read. Here's an example:
>
> I wrote a 1.6~ MB file into a file called "foo.jar", see "bytes"
> below, its ~1.58 MB:
>
> 2013-02-20 23:55:19,777 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /127.0.0.1:58785, dest: /127.0.0.1:50010, bytes: 1658314, op:
> HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_915204057_1, offset: 0,
> srvID: DS-1092147940-192.168.2.1-50010-1349279636946, blockid:
> BP-1461691939-192.168.2.1-1349279623549:blk_2568668834545125596_73870,
> duration: 192289000
>
> I ran the command "hadoop fs -cat foo.jar | head -c 5 > foo.xml" to
> store first 5 bytes onto a local file:
>
> Asserting that post command we get 5 bytes:
> ➜  ~ wc -c foo.xml
>        5 foo.xml
>
> Asserting that DN didn't IO-read the whole file, see the read op below
> and its "bytes" parameter, its only about 193 KB, not the whole block
> of 1.58 MB we wrote earlier:
>
> 2013-02-21 00:01:32,437 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /127.0.0.1:50010, dest: /127.0.0.1:58802, bytes: 198144, op:
> HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-1698829178_1, offset: 0,
> srvID: DS-1092147940-192.168.2.1-50010-1349279636946, blockid:
> BP-1461691939-192.168.2.1-1349279623549:blk_2568668834545125596_73870,
> duration: 19207000
>
> I don't see how this is anymore dangerous than doing a
> -copyToLocal/-get, which retrieves the whole file anyway?
>
> On Wed, Feb 20, 2013 at 9:25 PM, Jean-Marc Spaggiari
> <[EMAIL PROTECTED]> wrote:
>> But be careful.
>>
>> hadoop fs -cat will retrieve the entire file and last only when it
>> will have retrieve the last bytes you are looking for.
>>
>> If your file is many GB big, it will take a lot of time for this
>> command to complete and will put some pressure on your network.
>>
>> JM
>>
>> 2013/2/19, jamal sasha <[EMAIL PROTECTED]>:
>>> Awesome thanks :)
>>>
>>>
>>> On Tue, Feb 19, 2013 at 2:14 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>>>
>>>> You can instead use 'fs -cat' and the 'head' coreutil, as one example:
>>>>
>>>> hadoop fs -cat 100-byte-dfs-file | head -c 5 > 5-byte-local-file
>>>>
>>>> On Wed, Feb 20, 2013 at 3:38 AM, jamal sasha <[EMAIL PROTECTED]>
>>>> wrote:
>>>> > Hi,
>>>> >   I was wondering in the following command:
>>>> >
>>>> > bin/hadoop dfs -copyToLocal hdfspath localpath
>>>> > can we have specify to copy not full but like xMB's of file to local
>>>> drive?
>>>> >
>>>> > Is something like this possible
>>>> > Thanks
>>>> > Jamal
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>
>
>
> --
> Harsh J
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB