Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> has bzip2 compression been deprecated?


Copy link to this message
-
Re: has bzip2 compression been deprecated?
Hi Tim
       When you say in hive a table data is  compressed  by using LZO or so. It means the file/blocks that contains the records/data are compressed using LZO. The size would be same as the size of file/blocks in hdfs. It is not like records are stored as individual blocks in hive. Hive is just a query parser that parse SQL like queries into MR jobs and run the same on data that lies in HDFS.
When you a have larger chained jobs generated with multiple QLs you may end up in more number of small files. There you may go in for enabling merge in hive to get sufficiently larger files by merging thE smaller files as the final output from your queries. This would be better for subsequent MR jobs that operate on the output as well as optimal storage.

Hope it helps!..

Regards
Bejoy K S

-----Original Message-----
From: Tim Broberg <[EMAIL PROTECTED]>
Date: Mon, 9 Jan 2012 12:27:47
To: [EMAIL PROTECTED]<[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: RE: has bzip2 compression been deprecated?

Out of curiousity, when hive records are compressed, how large is a typical compressed record?

Do you have issues where the block size is too small to be compressed efficiently?

More generally, I wonder what the smallest desirable compressed record size is in the hadoop universe.

    - Tim.

________________________________________
From: Tony Burton [[EMAIL PROTECTED]]
Sent: Monday, January 09, 2012 10:02 AM
To: [EMAIL PROTECTED]
Subject: RE: has bzip2 compression been deprecated?

Thanks Bejoy - I'm fairly new to Hive so may be wrong here, but I was under the impression that the STORED AS part of a CREATE TABLE in Hive refers to how the data in the table will be stored once the table is created, rather than the compression format of the data used to populate the table. Can you clarify which is the correct interpretation? If it's the latter, how would I read a sequence file into a Hive table?

Thanks,

Tony
-----Original Message-----
From: Bejoy Ks [mailto:[EMAIL PROTECTED]]
Sent: 09 January 2012 17:33
To: [EMAIL PROTECTED]
Subject: Re: has bzip2 compression been deprecated?

Hi Tony
       Adding on to Harsh's comments. If you want the generated sequence
files to be utilized by a hive table. Define your hive table as

CREATE EXTERNAL TABLE tableNAme(col1 INT, c0l2 STRING)
...
...
....
STORED AS SEQUENCEFILE;
Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 10:32 PM, alo.alt <[EMAIL PROTECTED]> wrote:

> Tony,
>
> snappy is also available:
> http://code.google.com/p/hadoop-snappy/
>
> best,
>  Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Jan 9, 2012, at 8:49 AM, Harsh J wrote:
>
> > Tony,
> >
> > * Yeah, SequenceFiles aren't human-readable, but "fs -text" can read it
> out (instead of a plain "fs -cat"). But if you are gonna export your files
> into a system you do not have much control over, probably best to have the
> resultant files not be in SequenceFile/Avro-DataFile format.
> > * Intermediate (M-to-R) files use a custom IFile format these days,
> which is built purely for that purpose.
> > * Hive can use SequenceFiles very well. There is also documented info on
> this in the Hive's wiki pages (Check the DDL pages, IIRC).
> >
> > On 09-Jan-2012, at 9:44 PM, Tony Burton wrote:
> >
> >> Thanks for the quick reply and the clarification about the
> documentation.
> >>
> >> Regarding sequence files: am I right in thinking that they're a good
> choice for intermediate steps in chained MR jobs, or for file transfer
> between the Map and the Reduce phases of a job; but they shouldn't be used
> for human-readable files at the end of one or more MapReduce jobs? How
> about if the only use a job's output is analysis via Hive - can Hive create
> tables from sequence files?
> >>
> >> Tony
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:[EMAIL PROTECTED]]
> >> Sent: 09 January 2012 15:34
> >> To: [EMAIL PROTECTED]

www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM
**********************************************************************
This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404). Any financial promotion contained herein has been issued
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB