Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> RCFile vs SequenceFile vs text files


Copy link to this message
-
Re: RCFile vs SequenceFile vs text files
In general, use Sequence Files + with GZip or Snappy Compression.
On Mon, Jan 27, 2014 at 2:44 PM, Thilina Gunarathne <[EMAIL PROTECTED]>wrote:

> Thanks Eric and Sharath for the pointers to ORC. Unfortunately ORC would
> not be an option for us as our cluster still runs Hive 0.9 and we won't be
> migrating any time soon.
>
> thanks,
> Thilina
>
>
> On Mon, Jan 27, 2014 at 2:35 PM, Sharath Punreddy <[EMAIL PROTECTED]>wrote:
>
>> Quick insights:
>>
>>
>> http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
>>
>>
>>
>>
>> On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) <
>> [EMAIL PROTECTED]> wrote:
>>
>>>  It sounds like ORC would be best.
>>>
>>>
>>>
>>>                 -Eric
>>>
>>>
>>>
>>> *From:* Thilina Gunarathne [mailto:[EMAIL PROTECTED]]
>>> *Sent:* Monday, January 27, 2014 11:05 AM
>>> *To:* [EMAIL PROTECTED]
>>> *Subject:* RCFile vs SequenceFile vs text files
>>>
>>>
>>>
>>> Dear all,
>>>
>>> We are trying to pick the right data storage format for the Hive table
>>> with the following requirement and would really appreciate any insights you
>>> can provide to help our decision.
>>>
>>> 1. ~50Billion records per month. ~14 columns per record and each record
>>> is ~100 bytes.  Table is partitioned by the date. Table gets populated
>>> periodically from another Hive query.
>>>
>>> 2. The columns are dense, so I'm not sure whether we'll get any space
>>> savings by using RCFiles.
>>>
>>> 3. Data needs to be compressed.
>>>
>>> 4. We will be doing lot of aggregation queries for selected columns.
>>> There will be ad-hoc queries for whole records as well.
>>>
>>> 5. We need the ability to run Java MapReduce programs on the underlying
>>> data. We have existing programs which use custom inputformats with
>>> compressed textfiles as input and we are willing to port them to use other
>>> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
>>>
>>> 6. Ability to use hive indexing.
>>>
>>> thanks a ton in advance,
>>>
>>> Thilina
>>>
>>>
>>>
>>> --
>>> https://www.cs.indiana.edu/~tgunarat/
>>> http://www.linkedin.com/in/thilina
>>>
>>> http://thilina.gunarathne.org
>>>
>>
>>
>>
>> --
>> Thank you
>>
>> Sharath Punreddy
>> 1201 Golden gate Dr,
>> Southlake TX 76092.
>> Phone:626-470-7867
>>
>
>
>
> --
> https://www.cs.indiana.edu/~tgunarat/
> http://www.linkedin.com/in/thilina
> http://thilina.gunarathne.org
>