Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> RCFile vs SequenceFile vs text files


Copy link to this message
-
Re: RCFile vs SequenceFile vs text files
Thanks Edward. I'm actually populating this table periodically from another
temporary table and OCR sounds like a good fit. But unfortunately we are
stuck with Hive 0.9.

I wonder how easy/hard to use the data stored as RCFile or ORC with Java
MapReduce?

thanks,
Thilina
On Mon, Jan 27, 2014 at 3:09 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> The thing about OCR is that it is great for tables created from other
> tables, (like the other columnar formats) but if you are logging directly
> to HDFS, a columnar format is not easy (possible) to write directly.
> Normally people store data in a very direct row oriented form and then
> there first map reduce job buckets/partitions/columnar-izes it.
>
>
> On Mon, Jan 27, 2014 at 2:44 PM, Thilina Gunarathne <[EMAIL PROTECTED]>wrote:
>
>> Thanks Eric and Sharath for the pointers to ORC. Unfortunately ORC would
>> not be an option for us as our cluster still runs Hive 0.9 and we won't be
>> migrating any time soon.
>>
>> thanks,
>> Thilina
>>
>>
>> On Mon, Jan 27, 2014 at 2:35 PM, Sharath Punreddy <[EMAIL PROTECTED]>wrote:
>>
>>> Quick insights:
>>>
>>>
>>> http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
>>>
>>>
>>>
>>>
>>> On Mon, Jan 27, 2014 at 1:29 PM, Eric Hanson (BIG DATA) <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>>  It sounds like ORC would be best.
>>>>
>>>>
>>>>
>>>>                 -Eric
>>>>
>>>>
>>>>
>>>> *From:* Thilina Gunarathne [mailto:[EMAIL PROTECTED]]
>>>> *Sent:* Monday, January 27, 2014 11:05 AM
>>>> *To:* [EMAIL PROTECTED]
>>>> *Subject:* RCFile vs SequenceFile vs text files
>>>>
>>>>
>>>>
>>>> Dear all,
>>>>
>>>> We are trying to pick the right data storage format for the Hive table
>>>> with the following requirement and would really appreciate any insights you
>>>> can provide to help our decision.
>>>>
>>>> 1. ~50Billion records per month. ~14 columns per record and each record
>>>> is ~100 bytes.  Table is partitioned by the date. Table gets populated
>>>> periodically from another Hive query.
>>>>
>>>> 2. The columns are dense, so I'm not sure whether we'll get any space
>>>> savings by using RCFiles.
>>>>
>>>> 3. Data needs to be compressed.
>>>>
>>>> 4. We will be doing lot of aggregation queries for selected columns.
>>>> There will be ad-hoc queries for whole records as well.
>>>>
>>>> 5. We need the ability to run Java MapReduce programs on the underlying
>>>> data. We have existing programs which use custom inputformats with
>>>> compressed textfiles as input and we are willing to port them to use other
>>>> formats. (how easy to use Java MapReduce with RCFiles vs SequenceFiles?)
>>>>
>>>> 6. Ability to use hive indexing.
>>>>
>>>> thanks a ton in advance,
>>>>
>>>> Thilina
>>>>
>>>>
>>>>
>>>> --
>>>> https://www.cs.indiana.edu/~tgunarat/
>>>> http://www.linkedin.com/in/thilina
>>>>
>>>> http://thilina.gunarathne.org
>>>>
>>>
>>>
>>>
>>> --
>>> Thank you
>>>
>>> Sharath Punreddy
>>> 1201 Golden gate Dr,
>>> Southlake TX 76092.
>>> Phone:626-470-7867
>>>
>>
>>
>>
>> --
>> https://www.cs.indiana.edu/~tgunarat/
>> http://www.linkedin.com/in/thilina
>> http://thilina.gunarathne.org
>>
>
>
--
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org