Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> ORC vs TEXT file


Copy link to this message
-
Re: ORC vs TEXT file
Hi Owen,

Thanks for your response.

My structure is like:

a)Textfile:
CREATE EXTERNAL TABLE test_textfile (
    COL1 BIGINT,
    COL2 STRING,
    COL3 BIGINT,
    COL4 STRING,
    COL5 STRING,
    COL6 BIGINT,
    COL7 BIGINT,
    COL8 BIGINT,
    COL9 BIGINT,
    COl10 BIGINT,
    COl11 BIGINT,
    COL12 STRING,
    COl13 STRING,
    COl14 STRING,
    COl15 BIGINT,
    COl16 STRING,
    COL17 DOUBLE,
    COl18 DOUBLE,
    COl19 DOUBLE,
    COl20 DOUBLE,
    COl21 DOUBLE,
    COL22 DOUBLE,
    COl23 DOUBLE,
    COL24 DOUBLE,
    COl25 DOUBLE,
    COL26 DOUBLE,
    COl27 DOUBLE,
    COL28 DOUBLE,
    COL29 DOUBLE,
    COl30 DOUBLE,
    COl31 DOUBLE,
    COL32 DOUBLE,
    COL33 STRING,
    COl34 STRING,
    COl35 DOUBLE,
    COL36 DOUBLE,
    COl37 DOUBLE,
    COL38 DOUBLE,
    COl39 DOUBLE,
    COL40 DOUBLE,
    COl41 DOUBLE,
    COL42 DOUBLE,
    COL43 DOUBLE,
    COl44 DOUBLE,
    COl45 DOUBLE,
    COL46 DOUBLE,
    COL47 DOUBLE,
    COl48 DOUBLE,
    COl49 DOUBLE,
    COL50 DOUBLE,
    COL51 DOUBLE,
    COl52 DOUBLE,
    COl53 DOUBLE,
    COl54 DOUBLE,
    COL55 DOUBLE,
    COL56 STRING,
    COL57 DOUBLE,
    COL58 DOUBLE,
    COL59 DOUBLE,
    COl60 DOUBLE,
    COl61 STRING,
    COL62 STRING,
    COL63 STRING,
    COL64 STRING,
    COl65 STRING,
    COl66 STRING,
    COl67 STRING,
    COL68 STRING,
    Col69 STRING,
    COL70 STRING,
    COL71 STRING,
    COl72 STRING,
    COl73 STRING,
    COL74  STRING
) PARTITIONED BY (
    COL75 STRING,
    COL76 STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION 's3://test/textfile/';
Using block level compression and bzip2codec  for output.

b) With the above set of columns, just i have changed as STORED AS ORC for
creating ORC. Not using any compression option

c)Inserted 7256852 records in  both the tables

d)Space occupied in S3:

Storing as ORC(3 files):153.4MB *3=460.2MB
TEXT(single file in bz2 format)=306MB

I need to check ORC with compression enabled.

Please let me know, if i miss anything.

Thanks,
On Mon, Aug 12, 2013 at 8:50 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:

> Pandees,
>   I've never seen a table that was larger with ORC than with text. Can you
> share your text's file schema with us? Is the table very small? How many
> rows and GB are the tables? The overhead for ORC is typically small, but as
> Ed says it is possible for rare cases for the overhead to dominate the data
> size itself.
>
> -- Owen
>
>
> On Mon, Aug 12, 2013 at 6:52 AM, pandees waran <[EMAIL PROTECTED]> wrote:
>
>> Thanks Edward.  I shall try compression besides orc and let you know. And
>> also,  it looks like the cpu  usage is lesser while querying orc rather
>> than text file.
>> But the total time taken by the query time is slightly more in orc than
>> text file.  Could you please explain the difference between cumulative cpu
>> time and the total time taken (usually in last line in terms or secs)?
>> Which one should we give preference?
>> On Aug 12, 2013 7:01 PM, "Edward Capriolo" <[EMAIL PROTECTED]> wrote:
>>
>>> Colmnar formats do not always beat row wise storage. Many times gzip
>>> plus block storage will compress something better then columnar storage
>>> especially when you have repeated data in different columns.
>>>
>>> Based on what you are saying it could be possible that you missed a
>>> setting and the ocr are not compressed.
>>>
>>>
>>> On Monday, August 12, 2013, pandees waran <[EMAIL PROTECTED]> wrote:
>>> > Hi,
>>> >
>>> > Currently, we use TEXTFILE format in hive 0.8 ,while creating the
>>> > external tables in intermediate processing .
>>> > I have read about ORC in 0.11. I have created the same table in 0.11
>>> > with ORC format.
>>> > Without any compression, the ORC file(totally 3 files) occupied the
>>> > space twice more than the TEXTFILE(only one file).
>>> > Even, when i query the data from ORC:
>>> > Select count(*) from orc_table
>>> >
>>> > It took more time than the same query against textfile.
Thanks,
Pandeeswaran
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB