Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Table Wrapper


Copy link to this message
-
Re: Table Wrapper
Well.  You got a few suggestions there Peter.  That in itself is reason to
celebrate!

And that was a good description and i fault you not for going into some
detail.  The part about keeping it simple is always a challenge I know. :)

I get your point but i don't have anything more really to offer. if running
an MR job is a blocker i'm not in a position to suggest anything.
On Thu, Jun 27, 2013 at 3:14 AM, Peter Marron <
[EMAIL PROTECTED]> wrote:

>  Hi,****
>
> ** **
>
> If you’re suggesting that I use something like****
>
> ** **
>
> SELECT * FROM data WHERE MyUdf(data. BLOCK__OFFSET__INSIDE__FILE);****
>
> ** **
>
> rather than ****
>
> SELECT * FROM data JOIN small ON data.BLOCK__OFFSET__INSIDE__FILE > small.offset;****
>
> then, yes, I have thought of that. However the fact is that reading the
> billions of records****
>
> and filtering them is too slow compared to doing the seeks.****
>
> ** **
>
> Partitioning would help, but I can’t assume that the big data is
> partitioned in a way that suits****
>
> this particular query.****
>
> ** **
>
> Z****
>
> ** **
>
> *From:* Jan Dolinár [mailto:[EMAIL PROTECTED]]
> *Sent:* 27 June 2013 10:59
> *To:* user
> *Subject:* Re: Table Wrapper****
>
> ** **
>
> Slightly less "hackish" way to do this without joins is to write custom
> UDF that will take data.BLOCK__OFFSET__INSIDE__FILE as input parameter and
> return the corresponding data from the small file. If you mark it
> "deterministic" using @UDFType(deterministic = true), the performance
> should be quite good. ****
>
> ** **
>
> To avoid the full table scan, partitioning is IMHO the best way to speed
> things up.****
>
> ** **
>
> Best regards,****
>
> J. Dolinar****
>
> ** **
>
> On Thu, Jun 27, 2013 at 11:18 AM, Peter Marron <
> [EMAIL PROTECTED]> wrote:****
>
> Hi,****
>
>  ****
>
> I have thought about a map-only join, but as I understand it this is still
> going****
>
> to do a full table scan on my large data file. If this is billions of
> records then it’s****
>
> still going to be slow, even if it only returns a handful of records.****
>
>  ****
>
> Also I don’t know of any way to get Hive to do a join without performing a
> ****
>
> Map/Reduce. And, as I mentioned before, just the overheads of setting****
>
> up a Map/Reduce, even if it’s map only and does practically nothing, makes
> the****
>
> elapsed time too high. I want it to be interactive. (I guess that
> something Tez****
>
> when it becomes available might solve this problem…)****
>
>  ****
>
> My ugly “hack” approach works in seconds, the overhead for setting up
> Map/Reduce takes this into minutes.****
>
>  ****
>
> Indexing looks promising but, as far as I can see, it can’t be done
> without a Map/Reduce.****
>
>  ****
>
> If I could find a way to perform a join or use indexing without a
> Map/Reduce I would be happy to use that approach.****
>
>  ****
>
> Partitioning and ORC would be helpful but I can’t assume anything about
> the original data format.****
>
>  ****
>
> Z****
>
>  ****
>
> *From:* Nitin Pawar [mailto:[EMAIL PROTECTED]]
> *Sent:* 27 June 2013 09:52****
>
>
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Table Wrapper****
>
>  ****
>
> few thoughts: ****
>
> If you have a smaller file (in size of MB's) have you tried considering
> map only join? ****
>
> also if you are interested in particular records from a table and do not
> want to go through entire table to find them, then partitioning + indexing
> will be handy.****
>
>  ****
>
> ORCFile Format (still very new) can help you in this regard as well. ****
>
>  ****
>
> On Thu, Jun 27, 2013 at 2:16 PM, Peter Marron <
> [EMAIL PROTECTED]> wrote:****
>
> Well, I’m not very good at keeping things brief, unfortunately.****
>
> But I’ll have a go, trying to keep things simple.****
>
>  ****
>
> Suppose that I have a data table in Hive and it has many rows – say
> billions.****
>
> I have another file stored in HDFS (it can be a Hive table too if it helps)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB