Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> ORC Tuning - Examples?


Copy link to this message
-
Re: ORC Tuning - Examples?
Hi John,

Here is my experience on the stripe size. For a given table, when the
stripe size is increased, the size of a column in a stripe increases, which
means the ORC reader can read a column from disks in a more efficient way
because the reader can sequentially read more data (assuming the reader and
the HDFS block are co-located). But, a larger stripe size may decrease the
number of concurrent Map tasks reading an ORC file because a Map task needs
to process at least one stripe (seems a stripe is not splitable right now).
If you can get enough degree of parallelism, I think increasing the stripe
size generally gives you better data reading efficiency in one task.
However, on HDDs, the benefit from increasing the stripe size on data
reading efficiency in a Map task is getting smaller with the increase of
the stripe size. So, for a table with only a few columns (assuming a single
ORC file is used), using a smaller stripe size may not significantly affect
data reading efficiency in a task, and you can potentially have more
concurrent tasks to read this ORC file. So, I think you need to tradeoff
the data reading efficiency in a single task (larger stripe size -> better
data reading efficiency in a task) and the degree of parallelism (smaller
stripe size -> more concurrent tasks to read an ORC file) when determining
the right stripe size.

btw, I have a paper studying file formats and it has some related contents.
Here is the link:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-13-5.pdf.

Thanks,

Yin
On Tue, Nov 12, 2013 at 8:51 PM, Lefty Leverenz <[EMAIL PROTECTED]>wrote:

> If you get some useful advice, let's improve the doc.
>
> -- Lefty
>
>
> On Tue, Nov 12, 2013 at 6:15 PM, John Omernik <[EMAIL PROTECTED]> wrote:
>
>> I am looking for guidance (read examples) on tuning ORC settings for my
>> data.  I see the documentation that shows the defaults, as well as a brief
>> description of what it is.  What I am looking for is some examples of
>> things to try.  *Note: I understand that nobody wants to make sweeping
>> declaring of set this setting without knowing the data*  That said, I would
>> love to see some examples, specifically around:
>>
>> orc.row.index.stride
>>
>> orc.compress.size
>>
>> orc.stripe.size
>>
>>
>> For example, I'd love to see some statements like:
>>
>>
>> If your data has lots of columns of small data, and you'd like better x,
>> try changing y setting because this allows hive to do z when querying.
>>
>>
>> If your data has few columns of large data, try changing y and this
>> allows hive to do z while querying.
>>
>>
>> It would be really neat to see some examples so we can get in and tune
>> our data. Right now, everything is a crapshoot for me, and I don't know if
>> there are detrimental affects that may make themselves known later.
>>
>>
>> Any input would be welcome.
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB