Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> RCFile vs SequenceFile vs text files


Copy link to this message
-
Re: RCFile vs SequenceFile vs text files
The thing about OCR is that it is great for tables created from other
tables, (like the other columnar formats) but if you are logging directly
to HDFS, a columnar format is not easy (possible) to write directly.
Normally people store data in a very direct row oriented form and then
there first map reduce job buckets/partitions/columnar-izes it.
On Mon, Jan 27, 2014 at 2:44 PM, Thilina Gunarathne <[EMAIL PROTECTED]>wrote:

> Thanks Eric and Sharath for the pointers to ORC. Unfortunately ORC would
> not be an option for us as our cluster still runs Hive 0.9 and we won't be
> migrating any time soon.
>
> thanks,
> Thilina
>
>
> On Mon, Jan 27, 2014 at 2:35 PM, Sharath Punreddy <[EMAIL PROTECTED]>wrote:
>
>> Quick insights:
>>
>>
>> http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
>>
>>
>>
>>
>> On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) <
>> [EMAIL PROTECTED]> wrote:
>>
>>>  It sounds like ORC would be best.
>>>
>>>
>>>
>>>                 -Eric
>>>
>>>
>>>
>>> *From:* Thilina Gunarathne [mailto:[EMAIL PROTECTED]]
>>> *Sent:* Monday, January 27, 2014 11:05 AM
>>> *To:* [EMAIL PROTECTED]
>>> *Subject:* RCFile vs SequenceFile vs text files
>>>
>>>
>>>
>>> Dear all,
>>>
>>> We are trying to pick the right data storage format for the Hive table
>>> with the following requirement and would really appreciate any insights you
>>> can provide to help our decision.
>>>
>>> 1. ~50Billion records per month. ~14 columns per record and each record
>>> is ~100 bytes.  Table is partitioned by the date. Table gets populated
>>> periodically from another Hive query.
>>>
>>> 2. The columns are dense, so I'm not sure whether we'll get any space
>>> savings by using RCFiles.
>>>
>>> 3. Data needs to be compressed.
>>>
>>> 4. We will be doing lot of aggregation queries for selected columns.
>>> There will be ad-hoc queries for whole records as well.
>>>
>>> 5. We need the ability to run Java MapReduce programs on the underlying
>>> data. We have existing programs which use custom inputformats with
>>> compressed textfiles as input and we are willing to port them to use other
>>> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
>>>
>>> 6. Ability to use hive indexing.
>>>
>>> thanks a ton in advance,
>>>
>>> Thilina
>>>
>>>
>>>
>>> --
>>> https://www.cs.indiana.edu/~tgunarat/
>>> http://www.linkedin.com/in/thilina
>>>
>>> http://thilina.gunarathne.org
>>>
>>
>>
>>
>> --
>> Thank you
>>
>> Sharath Punreddy
>> 1201 Golden gate Dr,
>> Southlake TX 76092.
>> Phone:626-470-7867
>>
>
>
>
> --
> https://www.cs.indiana.edu/~tgunarat/
> http://www.linkedin.com/in/thilina
> http://thilina.gunarathne.org
>