Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Running queries using index on HDFS


Copy link to this message
-
Re: Running queries using index on HDFS
To add to what Bobby said, you can get block locations with
fs.getFileBlockLocations() if you want to open based on locality.

-Joey

On Mon, Jul 25, 2011 at 3:00 PM, Robert Evans <[EMAIL PROTECTED]> wrote:
> Sofia,
>
> You can access any HDFS file from a normal java application so long as your classpath and some configuration is set up correctly.  That is all that the hadoop jar command does.  It is a shell script that sets up the environment for java to work with Hadoop.  Look at the example for the Tool Class
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Tool.html
>
> If you delete the JobConf stuff you can then just talk to the FIleSystem by doing the following
>
> Path p = new Path("URI OF FILE TO OPEN");
> FileSystem fs = p.getFileSystem(conf);
> InputStream in = fs.open(p);
>
> Now you can use in to read your data.  Just be sure to close it when you are done.
>
> --Bobby Evans
>
>
>
> On 7/25/11 4:40 PM, "Sofia Georgiakaki" <[EMAIL PROTECTED]> wrote:
>
> Good evening,
>
> I have built an Rtree on HDFS, in order to improve the query performance of high-selectivity spatial queries.
> The Rtree is composed of a number of hdfs files (each one created by one Reducer, so as the number of the files is equal to the number of the reducers), where each file is a subtree of the root of the Rtree.
> I investigate the way to use the Rtree in an efficient way, with respect to the locality of each file on hdfs (data-placement).
>
>
> I would like to ask, if it is possible to read a file which is on hdfs, from a java application (not MapReduce).
> In case this is not possible (as I believe), either I should download the files on the local filesystem (which is not a solution, since the files could be very large), orrun the queries using the Hadoop.
> In order to maximise the gain, I should probably process a batch of queries during each Job, and run each query on a node that is "near" to the files that are involved in handling the specific query.
>
> Can I find the node where each file is located (or at least most of its blocks), and run on that node a reducer that handles these queries? Could the function  DFSClient.getBlockLocations() help ?
>
> Thank you in advance,
> Sofia
>
>

--
Joseph Echeverria
Cloudera, Inc.
443.305.9434
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB