Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> parallel cat


+
Rita 2011-07-06, 10:08
+
Steve Loughran 2011-07-06, 11:35
+
Rita 2011-07-07, 07:22
+
Steve Loughran 2011-07-07, 09:35
Copy link to this message
-
Re: parallel cat
Thanks again Steve.

I will try to implement it with thrift.
On Thu, Jul 7, 2011 at 5:35 AM, Steve Loughran <[EMAIL PROTECTED]> wrote:

> On 07/07/11 08:22, Rita wrote:
>
>> Thanks Steve. This is exactly what I was looking for. Unfortunately, I don
>> see any example code for the implementation.
>>
>>
> No. I think I have access to russ's source somewhere, but there'd be
> paperwork in getting it released. Russ said it wasn't too hard to do, he
> just had to patch the DFS client to offer up the entire list of block
> locations to the client, and let the client program make the decision. If
> you discussed this on the hdfs-dev list (via a JIRA), you may be able to get
> a patch for this accepted, though you have to do the code and tests
> yourself.
>
>
>> On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughran<[EMAIL PROTECTED]>  wrote:
>>
>>  On 06/07/11 11:08, Rita wrote:
>>>
>>>  I have many large files ranging from 2gb to 800gb and I use hadoop fs
>>>> -cat
>>>> a
>>>> lot to pipe to various programs.
>>>>
>>>> I was wondering if its possible to prefetch the data for clients with
>>>> more
>>>> bandwidth. Most of my clients have 10g interface and datanodes are 1g.
>>>>
>>>> I was thinking, prefetch x blocks (even though it will cost extra
>>>> memory)
>>>> while reading block y. After block y is read, read the prefetched
>>>> blocked
>>>> and then throw it away.
>>>>
>>>> It should be used like this:
>>>>
>>>>
>>>> export PREFETCH_BLOCKS=2 #default would be 1
>>>> hadoop fs -pcat hdfs://namenode/verylarge file | program
>>>>
>>>> Any thoughts?
>>>>
>>>>
>>>>  Look at Russ Perry's work on doing very fast fetches from an HDFS
>>> filestore
>>> http://www.hpl.hp.com/****techreports/2009/HPL-2009-345.****pdf<http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf>
>>> <http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf<http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf>
>>> >
>>>
>>>
>>> Here the DFS client got some extra data on where every copy of every
>>> block
>>> was, and the client decided which machine to fetch it from. This made the
>>> best use of the entire cluster, by keeping each datanode busy.
>>>
>>>
>>> -steve
>>>
>>>
>>
>>
>>
>
--
--- Get your facts first, then you can distort them as you please.--
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB