Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Table Wrapper


Copy link to this message
-
Re: Table Wrapper
few thoughts:
If you have a smaller file (in size of MB's) have you tried considering map
only join?
also if you are interested in particular records from a table and do not
want to go through entire table to find them, then partitioning + indexing
will be handy.

ORCFile Format (still very new) can help you in this regard as well.
On Thu, Jun 27, 2013 at 2:16 PM, Peter Marron <
[EMAIL PROTECTED]> wrote:

>  Well, I’m not very good at keeping things brief, unfortunately.****
>
> But I’ll have a go, trying to keep things simple.****
>
> ** **
>
> Suppose that I have a data table in Hive and it has many rows – say
> billions.****
>
> I have another file stored in HDFS (it can be a Hive table too if it helps)
> ****
>
> and this file is small and contains file offsets into the data, Stored as
> binary,****
>
> 8 bytes per offset. Now suppose that I want to read the records from the
> data****
>
> defined by the offsets in the small file, in the order defined in the
> small file.****
>
> ** **
>
> How can I do that?****
>
> ** **
>
> The obvious way is to turn the small file into a Hive table and provide a
> custom****
>
> InputFormat which can read the binary. I’ve done that, that’s the easy
> part and****
>
> then I could form a query like this:****
>
> ** **
>
>                 SELECT * FROM data JOIN small ON data. ON
> data.BLOCK__OFFSET__INSIDE__FILE = small.offset;****
>
> ** **
>
> But, when it works, this performs awfully.****
>
> ** **
>
> The approach that I have taken is to create a “copy” of the data table
> which is “hacked” to use a custom input****
>
> format which knows about the small file and which overrides the record
> reader to use the offsets****
>
> as seeks before it reads the records. This is awkward, for various
> reasons, but it works well. I can****
>
> avoid a full table scan, in fact I can suppress any Map/Reduce and so the
> query runs very quickly.****
>
> ** **
>
> So I was just trying to “wrap” the data table so that I didn’t have to
> create the copy.****
>
> ** **
>
> I hope that you don’t regret asking too much.****
>
> ** **
>
> Regards,****
>
> ** **
>
> Z****
>
> ** **
>
> *From:* Stephen Sprague [mailto:[EMAIL PROTECTED]]
> *Sent:* 25 June 2013 18:37
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Table Wrapper****
>
> ** **
>
> Good luck, bro. :)   May i ask why are you doing this to yourself? I think
> your instincts are correct going down the path you describe sounds a tad
> more painful than just hitting yourself in the head with a hammer.
> Different strokes for different folks though.****
>
> so can we back up? what - briefly if possible - do you want to achieve
> with a "wrapper"? (i'm going to regret asking that i know.)****
>
> ** **
>
> ** **
>
> On Tue, Jun 25, 2013 at 7:29 AM, Peter Marron <
> [EMAIL PROTECTED]> wrote:****
>
> Hi,****
>
>  ****
>
> Running Hive 0.11.0 over Hadoop 1.0.4.****
>
>  ****
>
> I would like to be able to “wrap” a Hive table.****
>
>  ****
>
> So, if I have table “X” which uses SerDe “s” and InputFormat “i”****
>
> then I would like to be able to create a table “Y” which has a ****
>
> SerDe “ws” which is a wrapper of “s” (and so can encapsulate an instance
> of “s”)****
>
> and an InputFormat “wi” which is a wrapper of “I” (and similarly
> encapsulates an****
>
> instance of “i”).  So far I have done this by creating a table like this**
> **
>
>  ****
>
> CREATE TABLE Y (… copy of underlying table’s columns...)****
>
> ROW FORMAT SERDE 'ws'****
>
> WITH SERDEPROPERTIES (…****
>
> 'wrapped.serde.name'='org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',
> ****
>
> ‘wrapped.inputformat.name’=’TextInputFormat’,****
>
> 'serialization.format'='|',              'field.delim'='|'****
>
> )****
>
> STORED AS****
>
>   INPUTFORMAT 'wi'****
>
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> ****
>
> TBLPROPERTIES (…);****
>
>  ****
>
> I have to add the names of the underlying classes “s” and “I”****

Nitin Pawar
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB