Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Parts of a file as input


Copy link to this message
-
Re: Parts of a file as input
Hi Franc
        Adding on to Harsh's response. If you Partition your data accordingly in Hive you can easily switch on and off full data scans. Partitions and sub partitions(multi level partitions) would help you hit only the required data set.   How to partition is totally based on your use cases or the queries that are intended for the data set. If you are looking at sampling then you may need to incorporate Buckets as well.
Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Franc Carter <[EMAIL PROTECTED]>
Date: Tue, 27 Mar 2012 17:26:49
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Re: Parts of a file as input

On Tue, Mar 27, 2012 at 5:22 PM, Franc Carter <[EMAIL PROTECTED]>wrote:

> On Tue, Mar 27, 2012 at 5:09 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>
>> Franc,
>>
>> With the given info, all we can tell is that it is possible but we
>> can't tell how as we have no idea how your data/dimensions/etc. are
>> structured. Being a little more specific would help.
>>
>
> Thanks, I'll go in to more detail.
>
> We have data for a large number of entities (10's of millions) for 15+
> years with fairly fine grained timestamp (but we could do just day
> granularity).
>
> At the extremes, some queries will need to select a small number of
> entities for all 15 years and some queries needing most of the entities for
> a small time range.
>
> Our current architecture (which we are reviewing) stores the data in 'day
> files' with a sort that increase the chance that data we want will be close
> together. We can then seek inside the files and only retrieve/process the
> parts we we need.
>
> I'd like to avoid Hadoop having to read and process all of every file to
> answer queries that don't need all the data.
>
> Is that clearer ?
>
I should also add that we know the entities and time range we are
interested  in at query submission time
>
>
>> It is possible to select and pass the right set of inputs per job, and
>> to also implement record readers to only read what is needed
>> specifically. This all depends on how your files are structured.
>>
>> Taking a wild guess, Apache Hive with its columnar storage (RCFile)
>> format may also be what you are looking for.
>>
>
> Thanks I'll have a look in to that
>
> cheers
>
>
>>
>> On Tue, Mar 27, 2012 at 11:32 AM, Franc Carter
>> <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> >
>> > I'm very new to Hadoop and am working through how we may be able to
>> apply
>> > it to our data set.
>> >
>> > One of the things that I am struggling with is understanding if it is
>> > possible to pass tell Hadoop that only parts of the input file will be
>> > needed for a specific job. The reason I believe I may need this is that
>> we
>> > have two big dimensions in our data set. Queries may want only one of
>> these
>> > dimensions and while some un-needed reading is unavoidable there are
>> cases
>> > that reading the entire data set presents a very significant overhead.
>> >
>> > Or have I just misunderstood something ;-(
>> >
>> > thanks
>> >
>> > --
>> >
>> > *Franc Carter* | Systems architect | Sirca Ltd
>> >  <[EMAIL PROTECTED]>
>> >
>> > [EMAIL PROTECTED] | www.sirca.org.au
>> >
>> > Tel: +61 2 9236 9118
>> >
>> > Level 9, 80 Clarence St, Sydney NSW 2000
>> >
>> > PO Box H58, Australia Square, Sydney NSW 1215
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
>
> *Franc Carter* | Systems architect | Sirca Ltd
>  <[EMAIL PROTECTED]>
>
> [EMAIL PROTECTED] | www.sirca.org.au
>
> Tel: +61 2 9236 9118
>
> Level 9, 80 Clarence St, Sydney NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
>
--

*Franc Carter* | Systems architect | Sirca Ltd
 <[EMAIL PROTECTED]>

[EMAIL PROTECTED] | www.sirca.org.au

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB