Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> RCFile vs SequenceFile vs text files


Copy link to this message
-
Re: RCFile vs SequenceFile vs text files
Thanks Edward. I'm actually populating this table periodically from another
temporary table and OCR sounds like a good fit. But unfortunately we are
stuck with Hive 0.9.

I wonder how easy/hard to use the data stored as RCFile or ORC with Java
MapReduce?

thanks,
Thilina
On Mon, Jan 27, 2014 at 3:09 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> The thing about OCR is that it is great for tables created from other
> tables, (like the other columnar formats) but if you are logging directly
> to HDFS, a columnar format is not easy (possible) to write directly.
> Normally people store data in a very direct row oriented form and then
> there first map reduce job buckets/partitions/columnar-izes it.
>
>
> On Mon, Jan 27, 2014 at 2:44 PM, Thilina Gunarathne <[EMAIL PROTECTED]>wrote:
>
>> Thanks Eric and Sharath for the pointers to ORC. Unfortunately ORC would
>> not be an option for us as our cluster still runs Hive 0.9 and we won't be
>> migrating any time soon.
>>
>> thanks,
>> Thilina
>>
>>
>> On Mon, Jan 27, 2014 at 2:35 PM, Sharath Punreddy <[EMAIL PROTECTED]>wrote:
>>
>>> Quick insights:
>>>
>>>
>>> http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
>>>
>>>
>>>
>>>
>>> On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>>  It sounds like ORC would be best.
>>>>
>>>>
>>>>
>>>>                 -Eric
>>>>
>>>>
>>>>
>>>> *From:* Thilina Gunarathne [mailto:[EMAIL PROTECTED]]
>>>> *Sent:* Monday, January 27, 2014 11:05 AM
>>>> *To:* [EMAIL PROTECTED]
>>>> *Subject:* RCFile vs SequenceFile vs text files
>>>>
>>>>
>>>>
>>>> Dear all,
>>>>
>>>> We are trying to pick the right data storage format for the Hive table
>>>> with the following requirement and would really appreciate any insights you
>>>> can provide to help our decision.
>>>>
>>>> 1. ~50Billion records per month. ~14 columns per record and each record
>>>> is ~100 bytes.  Table is partitioned by the date. Table gets populated
>>>> periodically from another Hive query.
>>>>
>>>> 2. The columns are dense, so I'm not sure whether we'll get any space
>>>> savings by using RCFiles.
>>>>
>>>> 3. Data needs to be compressed.
>>>>
>>>> 4. We will be doing lot of aggregation queries for selected columns.
>>>> There will be ad-hoc queries for whole records as well.
>>>>
>>>> 5. We need the ability to run Java MapReduce programs on the underlying
>>>> data. We have existing programs which use custom inputformats with
>>>> compressed textfiles as input and we are willing to port them to use other
>>>> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
>>>>
>>>> 6. Ability to use hive indexing.
>>>>
>>>> thanks a ton in advance,
>>>>
>>>> Thilina
>>>>
>>>>
>>>>
>>>> --
>>>> https://www.cs.indiana.edu/~tgunarat/
>>>> http://www.linkedin.com/in/thilina
>>>>
>>>> http://thilina.gunarathne.org
>>>>
>>>
>>>
>>>
>>> --
>>> Thank you
>>>
>>> Sharath Punreddy
>>> 1201 Golden gate Dr,
>>> Southlake TX 76092.
>>> Phone:626-470-7867
>>>
>>
>>
>>
>> --
>> https://www.cs.indiana.edu/~tgunarat/
>> http://www.linkedin.com/in/thilina
>> http://thilina.gunarathne.org
>>
>
>
--
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB