I have not played with the stride length. Based on my understanding of the
code, since the stride length determines the number of rows between index
entries, if you decrease the stride length, you can get more fine-grained
indexes which can potentially help you to skip more unnecessary rows (with
predicate pushdown). I think the benefit of a smaller stride length is
pretty workload dependent. For example, you have a query like this
SELECT c1 FROM tbl WHERE c1 > 10 AND c1 < 20;
If tbl is sorted by c1, you may observe that less amount of data is read
from HDFS when you decrease the stride length. However, if tbl is not
sorted by c1, in the worst case, even if you set the stride length to its
minimum value (i.e. 1000), you still will see that the entire c1 is loaded
On Wed, Nov 13, 2013 at 4:44 PM, John Omernik <[EMAIL PROTECTED]> wrote:
> Yin -
> Fantastic! That is exactly the type of explanation of settings I'd like to
> see. More than just what it does, but the tradeoffs, and how things are
> applied in the real world. Have you played with the stride length at all?
> On Wed, Nov 13, 2013 at 1:13 PM, Yin Huai <[EMAIL PROTECTED]> wrote:
>> Hi John,
>> Here is my experience on the stripe size. For a given table, when the
>> stripe size is increased, the size of a column in a stripe increases, which
>> means the ORC reader can read a column from disks in a more efficient way
>> because the reader can sequentially read more data (assuming the reader and
>> the HDFS block are co-located). But, a larger stripe size may decrease the
>> number of concurrent Map tasks reading an ORC file because a Map task needs
>> to process at least one stripe (seems a stripe is not splitable right now).
>> If you can get enough degree of parallelism, I think increasing the stripe
>> size generally gives you better data reading efficiency in one task.
>> However, on HDDs, the benefit from increasing the stripe size on data
>> reading efficiency in a Map task is getting smaller with the increase of
>> the stripe size. So, for a table with only a few columns (assuming a single
>> ORC file is used), using a smaller stripe size may not significantly affect
>> data reading efficiency in a task, and you can potentially have more
>> concurrent tasks to read this ORC file. So, I think you need to tradeoff
>> the data reading efficiency in a single task (larger stripe size -> better
>> data reading efficiency in a task) and the degree of parallelism (smaller
>> stripe size -> more concurrent tasks to read an ORC file) when determining
>> the right stripe size.
>> btw, I have a paper studying file formats and it has some related
>> contents. Here is the link:
>> On Tue, Nov 12, 2013 at 8:51 PM, Lefty Leverenz <[EMAIL PROTECTED]>wrote:
>>> If you get some useful advice, let's improve the doc.
>>> -- Lefty
>>> On Tue, Nov 12, 2013 at 6:15 PM, John Omernik <[EMAIL PROTECTED]> wrote:
>>>> I am looking for guidance (read examples) on tuning ORC settings for my
>>>> data. I see the documentation that shows the defaults, as well as a brief
>>>> description of what it is. What I am looking for is some examples of
>>>> things to try. *Note: I understand that nobody wants to make sweeping
>>>> declaring of set this setting without knowing the data* That said, I would
>>>> love to see some examples, specifically around:
>>>> For example, I'd love to see some statements like:
>>>> If your data has lots of columns of small data, and you'd like better
>>>> x, try changing y setting because this allows hive to do z when querying.
>>>> If your data has few columns of large data, try changing y and this
>>>> allows hive to do z while querying.
>>>> It would be really neat to see some examples so we can get in and tune