Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Large Scale Table Reprocess


Copy link to this message
-
Re: Large Scale Table Reprocess
I believe:

alter table _tablename_ set fileformat orcfile;

will do what you want.  All future partitions that are added will be in orcfile format (assuming you use insert to create the partitions) or assumed to be in orcfile format if you do alter table add partition.

As to whether orcfile will change significantly enough to require a reprocess, we don't usually change file formats in non-backward compatible ways for exactly this reason.  And I know ORC stores much of its control info in protobufs in order to support changes going forward.

Alan.

On Jul 26, 2013, at 3:17 PM, John Omernik wrote:

> More specifically, we have a table that is currently defined as RCFile, to do this, I'd like to define all new partitions as ORC.  With the advent of ORC, these types of problems are going to come up for many folks, any guidance would be appreciated ...
>
> Also, based on the strategic goals of ORC files, do you see ORC files changing significantly (i.e. to the point where we have to do another re process?)
>
>
>
> On Fri, Jul 26, 2013 at 5:09 PM, John Omernik <[EMAIL PROTECTED]> wrote:
> Can you give some examples of how to alter partitions for different input types? I'd appreciate it :)
>
>
> On Fri, Jul 26, 2013 at 3:29 PM, Alan Gates <[EMAIL PROTECTED]> wrote:
> A table can definitely have partitions with different input formats/serdes.  We test this all the time.
>
> Assuming your old data doesn't stay for ever and most of your queries are on more recent data (which is usually the case) I'd advise you to not reprocess any data, just alter the table to store new partitions in ORC.  Then with time you'll slowly transition the table to ORC.  This avoids all the issues you noted.  And since most queries probably only access recent data you'll see speed ups soon after the switch.
>
> Alan.
>
> On Jul 25, 2013, at 4:45 PM, John Omernik wrote:
>
> > Just finishing up testing with Hive 11 and ORC. Thank you to Owen and all those who have put hard work into this. Just ORC files, when compared to RC files in Hive 9, 10, and 11 saw a huge increase in performance, it was amazing.  That said, now we gotta reprocess.
> >
> >
> > We have a large table with lots of partitions. I'd love to be able to reprocess into a new table, like table_orc, and then at the end of it all, just drop the original table. That said, I see it being hard to do from a space perspective. and I will have to do partition at a time.  But then theirs production issues, if I update a partition, insert overwrite int the ORC table, then I have delete the original and production users will be missing data.... decisions decisions.
> >
> > So any ideas? Can a table have some partitions in one file type and other partitions in another? That sounds scary.  Anywho, a good problem to have... that performance will be worth it.
> >
> >
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB