Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - copy chunk of hadoop output


Copy link to this message
-
Re: copy chunk of hadoop output
Harsh J 2013-02-20, 20:21
No problem JM, I was confused as well.

AFAIK, there's no shell utility that can let you specify an offset #
of bytes to start off with (similar to skip in dd?), but that can be
done from the FS API.

On Thu, Feb 21, 2013 at 1:14 AM, Jean-Marc Spaggiari
<[EMAIL PROTECTED]> wrote:
> Hi Harsh,
>
> My bad.
>
> I read the example quickly and I don't know why I tought you used tail
> and not head.
>
> head will work perfectly. But tail will not since it will need to read
> the entier file. My comment was for tail, not for head, and therefore
> not application to the example you gave.
>
>
> hadoop fs -cat 100-byte-dfs-file | tail -c 5 > 5-byte-local-file
>
> Will have to download the entire file.
>
> Is there a way to "jump" into a certain position in a file and "cat" from there?
>
> JM
>
> 2013/2/20, Harsh J <[EMAIL PROTECTED]>:
>> Hi JM,
>>
>> I am not sure how "dangerous" it is, since we're using a pipe here,
>> and as you yourself note, it will only last as long as the last bytes
>> have been got and then terminate.
>>
>> The -cat process will terminate because the
>> process we're piping to will terminate first after it reaches its goal
>> of -c <N bytes>; so certainly the "-cat" program will not fetch the
>> whole file down but it may fetch a few bytes extra over communication
>> due to use of read buffers (the extra data won't be put into the target
>> file, and get discarded).
>>
>> We can try it out and observe the "clienttrace" logged
>> at the DN at the end of the -cat's read. Here's an example:
>>
>> I wrote a 1.6~ MB file into a file called "foo.jar", see "bytes"
>> below, its ~1.58 MB:
>>
>> 2013-02-20 23:55:19,777 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /127.0.0.1:58785, dest: /127.0.0.1:50010, bytes: 1658314, op:
>> HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_915204057_1, offset: 0,
>> srvID: DS-1092147940-192.168.2.1-50010-1349279636946, blockid:
>> BP-1461691939-192.168.2.1-1349279623549:blk_2568668834545125596_73870,
>> duration: 192289000
>>
>> I ran the command "hadoop fs -cat foo.jar | head -c 5 > foo.xml" to
>> store first 5 bytes onto a local file:
>>
>> Asserting that post command we get 5 bytes:
>> ➜  ~ wc -c foo.xml
>>        5 foo.xml
>>
>> Asserting that DN didn't IO-read the whole file, see the read op below
>> and its "bytes" parameter, its only about 193 KB, not the whole block
>> of 1.58 MB we wrote earlier:
>>
>> 2013-02-21 00:01:32,437 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /127.0.0.1:50010, dest: /127.0.0.1:58802, bytes: 198144, op:
>> HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-1698829178_1, offset: 0,
>> srvID: DS-1092147940-192.168.2.1-50010-1349279636946, blockid:
>> BP-1461691939-192.168.2.1-1349279623549:blk_2568668834545125596_73870,
>> duration: 19207000
>>
>> I don't see how this is anymore dangerous than doing a
>> -copyToLocal/-get, which retrieves the whole file anyway?
>>
>> On Wed, Feb 20, 2013 at 9:25 PM, Jean-Marc Spaggiari
>> <[EMAIL PROTECTED]> wrote:
>>> But be careful.
>>>
>>> hadoop fs -cat will retrieve the entire file and last only when it
>>> will have retrieve the last bytes you are looking for.
>>>
>>> If your file is many GB big, it will take a lot of time for this
>>> command to complete and will put some pressure on your network.
>>>
>>> JM
>>>
>>> 2013/2/19, jamal sasha <[EMAIL PROTECTED]>:
>>>> Awesome thanks :)
>>>>
>>>>
>>>> On Tue, Feb 19, 2013 at 2:14 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> You can instead use 'fs -cat' and the 'head' coreutil, as one example:
>>>>>
>>>>> hadoop fs -cat 100-byte-dfs-file | head -c 5 > 5-byte-local-file
>>>>>
>>>>> On Wed, Feb 20, 2013 at 3:38 AM, jamal sasha <[EMAIL PROTECTED]>
>>>>> wrote:
>>>>> > Hi,
>>>>> >   I was wondering in the following command:
>>>>> >
>>>>> > bin/hadoop dfs -copyToLocal hdfspath localpath
>>>>> > can we have specify to copy not full but like xMB's of file to local
>>>>> drive?
>>>>> >
>>>>> > Is something like this possible

Harsh J