-Re: Hive sort by using a single reducer
Ruslan Al-Fakikh 2012-09-04, 18:55
Sort By will give you only partially sorted results if you have more
than one reducer
On Mon, Sep 3, 2012 at 1:38 AM, Binesh Gummadi <[EMAIL PROTECTED]> wrote:
> Thanks for your quick reply. Rank is a column which has integer data. I am
> writing to dynamoDB database tho. Not sure why only a single reducer is used
> tho. I will check sql with explain command again and will report my
> findings. I will check your implementation too.
> Binesh Gummadi
> On Sun, Sep 2, 2012 at 4:01 PM, Edward Capriolo <[EMAIL PROTECTED]>
>> Sort by does not have the single reduce restriction. Not sure which rank
>> you are using but any one should allow you to sort and rank if the query is
>> written correctly. Our implementation on my github.com/edwardcapriolo allows
>> On Sunday, September 2, 2012, Binesh Gummadi <[EMAIL PROTECTED]>
>> > I am trying to insert data into a table after selecting and sorting by a
>> > column. What I really want is order by a column and select the top million
>> > rows. I am using Amazon EMR hive cloud to process data.
>> > Here is my query
>> > INSERT INTO TABLE ddb_table SELECT * FROM data_dump sort by rank desc
>> > LIMIT 1000000;
>> > It creates two jobs. First job run rather quickly and second job reducer
>> > is running forever as it is running with a single reducer. Here is my
>> > question on
>> > stackoverflow(http://stackoverflow.com/questions/12233343/why-is-sort-by-always-using-single-reducer).
>> > According to docs "order by" clause has a limitation of 1 reducer. Does
>> > sort by has same limitation? Are there any other ways of solving the above
>> > requirement?
>> > ________________________________
>> > Binesh Gummadi