Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> has bzip2 compression been deprecated?


Copy link to this message
-
Re: has bzip2 compression been deprecated?
Yes. Hive doesn't format data when you load it. The only exception is if you do an INSERT OVERWRITE ... .

-Joey

On Jan 10, 2012, at 6:08, Tony Burton <[EMAIL PROTECTED]> wrote:

> Thanks for this Bejoy, very helpful.
>
> So, to summarise: when I CREATE EXTERNAL TABLE in Hive, the STORED AS, ROW FORMAT and other parameters you mention are telling Hive what to expect when it reads the data I want to analyse, despite not checking the data to see if it meets these criteria?
>
> Do these guidelines still apply if the table is not EXTERNAL?
>
> Tony
>
>
>
> -----Original Message-----
> From: Bejoy Ks [mailto:[EMAIL PROTECTED]]
> Sent: 09 January 2012 19:00
> To: [EMAIL PROTECTED]
> Subject: Re: has bzip2 compression been deprecated?
>
> Hi Tony
>       As  I understand your requirement, your mapreduce job produces a
> Sequence File as ouput and you need to use this file as an input to hive
> table.
>        When you CREATE and EXTERNAL Table in hive you specify a location
> where your data is stored and also what is the format of that data( like
> the field delimiter,row delimiter, file type etc of your data). You are
> actually not loading data any where when you create a hive external
> table(issue DDL), just specifying where the data lies in file system in
> fact there is not even any validation performed that time to check on the
> data quality. When you Query/Retrive your data  through Hive QLs the
> parameters specified along with CREATE TABLE as ROW FORMAT,FILEDS
> TERMINATED, STORED AS etc are used to execute the right MAP REDUCE job(s).
>
>     In short STORED AS refer to the type of files that a table's data
> directory holds.
>
> For details
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
>
> Hope it helps!..
>
> Regards
> Bejoy.K.S
>
> On Mon, Jan 9, 2012 at 11:32 PM, Tony Burton <[EMAIL PROTECTED]>wrote:
>
>> Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was
>> under the impression that the STORED AS part of a CREATE TABLE in Hive
>> refers to how the data in the table will be stored once the table is
>> created, rather than the compression format of the data used to populate
>> the table. Can you clarify which is the correct interpretation? If it's the
>> latter, how would I read a sequence file into a Hive table?
>>
>> Thanks,
>>
>> Tony
>>
>>
>>
>>
>> -----Original Message-----
>> From: Bejoy Ks [mailto:[EMAIL PROTECTED]]
>> Sent: 09 January 2012 17:33
>> To: [EMAIL PROTECTED]
>> Subject: Re: has bzip2 compression been deprecated?
>>
>> Hi Tony
>>      Adding on to Harsh's comments. If you want the generated sequence
>> files to be utilized by a hive table. Define your hive table as
>>
>> CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
>> ...
>> ...
>> ....
>> STORED AS SEQUENCEFILE;
>>
>>
>> Regards
>> Bejoy.K.S
>>
>> On Mon, Jan 9, 2012 at 10:32 PM, alo.alt <[EMAIL PROTECTED]> wrote:
>>
>>> Tony,
>>>
>>> snappy is also available:
>>> http://code.google.com/p/hadoop-snappy/
>>>
>>> best,
>>> Alex
>>>
>>> --
>>> Alexander Lorenz
>>> http://mapredit.blogspot.com
>>>
>>> On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
>>>
>>>> Tony,
>>>>
>>>> * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
>>> out (instead of a plain "fs -cat"). But if you are gonna export your
>> files
>>> into a system you do not have much control over, probably best to have
>> the
>>> resultant files not be in SequenceFile/Avro-DataFile format.
>>>> * Intermediate (M-to-R) files use a custom IFile format these days,
>>> which is built purely for that purpose.
>>>> * Hive can use SequenceFiles very well. There is also documented info
>> on
>>> this in the Hive's wiki pages (Check the DDL pages, IIRC).
>>>>
>>>> On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
>>>>
>>>>> Thanks for the quick reply and the clarification about the
>>> documentation.
>>>>>
>>>>> Regarding sequence files: am I right in thinking that they're a good
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB