Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - has bzip2 compression been deprecated?

Copy link to this message
Re: has bzip2 compression been deprecated?
Bejoy Ks 2012-01-10, 18:01
Hi Tony
      Please find responses inline

So, to summarise: when I CREATE EXTERNAL TABLE in Hive, the STORED AS, ROW
FORMAT and other parameters you mention are telling Hive what to expect
when it reads the data I want to analyse, despite not checking the data to
see if it meets these criteria?

[Bejoy] Yes, no data format validation is performed on CREATE TABLE. You
get to know the data issues when you QUERY the table.

Do these guidelines still apply if the table is not EXTERNAL?

[Bejoy] Yes, EXTERNAL tables are not far different from hive managed tables
(normal tables) . The basic diffrence is that when you do CREATE TABLE the
data dir is created under /usr/hive/warehouse (in default conf) and in case
of EXTERNAL TABLES you can point to any dir in hdfs as the data dir. The
main difference to keep in mind is if you DROP an EXTERNAL TABLE the data
dir in hdfs is not deleted where in case of NORMAL TABLES it is deleted
(you completely lose data here).


On Tue, Jan 10, 2012 at 5:12 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> Tony,
> Sorry for being ambiguous, I was too lazy to search at the time. This has
> been the case since release 0.18.0. See
> https://issues.apache.org/jira/browse/HADOOP-2095 for more information.
> On 10-Jan-2012, at 4:18 PM, Tony Burton wrote:
> > Thanks all for advice - one more question on re-reading Harsh's helpful
> reply. " Intermediate (M-to-R) files use a custom IFile format these days".
> How recently is "these days", and can this addition be pinned down to any
> one version of Hadoop?
> >
> > Tony
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Harsh J [mailto:[EMAIL PROTECTED]]
> > Sent: 09 January 2012 16:50
> > Subject: Re: has bzip2 compression been deprecated?
> >
> > Tony,
> >
> > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> out (instead of a plain "fs -cat"). But if you are gonna export your files
> into a system you do not have much control over, probably best to have the
> resultant files not be in SequenceFile/Avro-DataFile format.
> > * Intermediate (M-to-R) files use a custom IFile format these days,
> which is built purely for that purpose.
> > * Hive can use SequenceFiles very well. There is also documented info on
> this in the Hive's wiki pages (Check the DDL pages, IIRC).
> >
> > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> >
> >> Thanks for the quick reply and the clarification about the
> documentation.
> >>
> >> Regarding sequence files: am I right in thinking that they're a good
> choice for intermediate steps in chained MR jobs, or for file transfer
> between the Map and the Reduce phases of a job; but they shouldn't be used
> for human-readable files at the end of one or more MapReduce jobs? How
> about if the only use a job's output is analysis via Hive - can Hive create
> tables from sequence files?
> >>
> >> Tony
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:[EMAIL PROTECTED]]
> >> Sent: 09 January 2012 15:34
> >> Subject: Re: has bzip2 compression been deprecated?
> >>
> >> Bzip2 is pretty slow. You probably do not want to use it, even if it
> does file splits (a feature not available in the stable line of 0.20.x/1.x,
> but available in 0.22+).
> >>
> >> To answer your question though, bzip2 was removed from that document
> cause it isn't a native library (its pure Java). I think bzip2 was added
> earlier due to an oversight, as even 0.20 did not have a native bzip2
> library. This change in docs does not mean that BZip2 is deprecated -- it
> is still fully supported and available in the trunk as well. See
> https://issues.apache.org/jira/browse/HADOOP-6292 for the doc update
> changes that led to this.
> >>
> >> The best way would be to use either:
> >>
> >> (a) Hadoop sequence files with any compression codec of choice (best
> would be lzo, gz, maybe even snappy). This file format is built for HDFS