Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - has bzip2 compression been deprecated?


Copy link to this message
-
Re: has bzip2 compression been deprecated?
Joey Echeverria 2012-01-10, 11:27
Yes. Hive doesn't format data when you load it. The only exception is if you do an INSERT OVERWRITE ... .

-Joey

On Jan 10, 2012, at 6:08, Tony Burton <[EMAIL PROTECTED]> wrote:

> Thanks for this Bejoy, very helpful.
>
> So, to summarise: when I CREATE EXTERNAL TABLE in Hive, the STORED AS, ROW FORMAT and other parameters you mention are telling Hive what to expect when it reads the data I want to analyse, despite not checking the data to see if it meets these criteria?
>
> Do these guidelines still apply if the table is not EXTERNAL?
>
> Tony
>
>
>
> -----Original Message-----
> From: Bejoy Ks [mailto:[EMAIL PROTECTED]]
> Sent: 09 January 2012 19:00
> To: [EMAIL PROTECTED]
> Subject: Re: has bzip2 compression been deprecated?
>
> Hi Tony
>       As  I understand your requirement, your mapreduce job produces a
> Sequence File as ouput and you need to use this file as an input to hive
> table.
>        When you CREATE and EXTERNAL Table in hive you specify a location
> where your data is stored and also what is the format of that data( like
> the field delimiter,row delimiter, file type etc of your data). You are
> actually not loading data any where when you create a hive external
> table(issue DDL), just specifying where the data lies in file system in
> fact there is not even any validation performed that time to check on the
> data quality. When you Query/Retrive your data  through Hive QLs the
> parameters specified along with CREATE TABLE as ROW FORMAT,FILEDS
> TERMINATED, STORED AS etc are used to execute the right MAP REDUCE job(s).
>
>     In short STORED AS refer to the type of files that a table's data
> directory holds.
>
> For details
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
>
> Hope it helps!..
>
> Regards
> Bejoy.K.S
>
> On Mon, Jan 9, 2012 at 11:32 PM, Tony Burton <[EMAIL PROTECTED]>wrote:
>
>> Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was
>> under the impression that the STORED AS part of a CREATE TABLE in Hive
>> refers to how the data in the table will be stored once the table is
>> created, rather than the compression format of the data used to populate
>> the table. Can you clarify which is the correct interpretation? If it's the
>> latter, how would I read a sequence file into a Hive table?
>>
>> Thanks,
>>
>> Tony
>>
>>
>>
>>
>> -----Original Message-----
>> From: Bejoy Ks [mailto:[EMAIL PROTECTED]]
>> Sent: 09 January 2012 17:33
>> To: [EMAIL PROTECTED]
>> Subject: Re: has bzip2 compression been deprecated?
>>
>> Hi Tony
>>      Adding on to Harsh's comments. If you want the generated sequence
>> files to be utilized by a hive table. Define your hive table as
>>
>> CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
>> ...
>> ...
>> ....
>> STORED AS SEQUENCEFILE;
>>
>>
>> Regards
>> Bejoy.K.S
>>
>> On Mon, Jan 9, 2012 at 10:32 PM, alo.alt <[EMAIL PROTECTED]> wrote:
>>
>>> Tony,
>>>
>>> snappy is also available:
>>> http://code.google.com/p/hadoop-snappy/
>>>
>>> best,
>>> Alex
>>>
>>> --
>>> Alexander Lorenz
>>> http://mapredit.blogspot.com
>>>
>>> On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
>>>
>>>> Tony,
>>>>
>>>> * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
>>> out (instead of a plain "fs -cat"). But if you are gonna export your
>> files
>>> into a system you do not have much control over, probably best to have
>> the
>>> resultant files not be in SequenceFile/Avro-DataFile format.
>>>> * Intermediate (M-to-R) files use a custom IFile format these days,
>>> which is built purely for that purpose.
>>>> * Hive can use SequenceFiles very well. There is also documented info
>> on
>>> this in the Hive's wiki pages (Check the DDL pages, IIRC).
>>>>
>>>> On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
>>>>
>>>>> Thanks for the quick reply and the clarification about the
>>> documentation.
>>>>>
>>>>> Regarding sequence files: am I right in thinking that they're a good