|
|
-
Maximum Number of Hive Partitions = 256?
Time Less 2011-05-04, 01:51
I created a partitioned table, partitioned daily. If I query the earlier partitions, everything works. The later ones fail with error:
hive> select substr(user_name,1,1),count(*) from u_s_h_b where dtpartition='2010-10-24' group by substr(user_name,1,1) ; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.hadoop.mapred.FileInputFormat.identifyHosts(FileInputFormat.java:556) at org.apache.hadoop.mapred.FileInputFormat.getSplitHosts(FileInputFormat.java:524) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:235) ......snip....... at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) Job Submission failed with exception 'java.lang.ArrayIndexOutOfBoundsException(0)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
It turns out that 2010-10-24 is 257 days from the very first partition in my dataset (2010-01-09):
| date_sub('2010-10-24',interval 257 day) | +-----------------------------------------+ | 2010-02-09 |
That seems like an interesting coincidence. But try as I might, the Great Googles will not show me a way to tune this, or even if it is tuneable, or expected. Has anyone else run into a 256-partition limit in Hive? How do you work around it? Why is that even the limit?! Shouldn't it be more like 32-bit maxint??!!
Thanks!
-- Tim Ellis Riot Games
+
Time Less 2011-05-04, 01:51
-
RE: Maximum Number of Hive Partitions = 256?
Steven Wong 2011-05-04, 02:02
I have way more than 256 partitions per table. AFAIK, there is no partition limit.
>From your stack trace, you have some host name issue somewhere. From: Time Less [mailto:[EMAIL PROTECTED]] Sent: Tuesday, May 03, 2011 6:52 PM To: [EMAIL PROTECTED] Subject: Maximum Number of Hive Partitions = 256?
I created a partitioned table, partitioned daily. If I query the earlier partitions, everything works. The later ones fail with error:
hive> select substr(user_name,1,1),count(*) from u_s_h_b where dtpartition='2010-10-24' group by substr(user_name,1,1) ; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.hadoop.mapred.FileInputFormat.identifyHosts(FileInputFormat.java:556) at org.apache.hadoop.mapred.FileInputFormat.getSplitHosts(FileInputFormat.java:524) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:235) ......snip....... at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) Job Submission failed with exception 'java.lang.ArrayIndexOutOfBoundsException(0)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
It turns out that 2010-10-24 is 257 days from the very first partition in my dataset (2010-01-09):
| date_sub('2010-10-24',interval 257 day) | +-----------------------------------------+ | 2010-02-09 |
That seems like an interesting coincidence. But try as I might, the Great Googles will not show me a way to tune this, or even if it is tuneable, or expected. Has anyone else run into a 256-partition limit in Hive? How do you work around it? Why is that even the limit?! Shouldn't it be more like 32-bit maxint??!!
Thanks!
-- Tim Ellis Riot Games
+
Steven Wong 2011-05-04, 02:02
-
Re: Maximum Number of Hive Partitions = 256?
Viral Bajaria 2011-05-04, 02:53
same here ... we have way more than 256 partitions in multiple tables. I am sure the issue has something to do with an empty string passed to the substr function. can you validate that the table has no null/empty string for user_name or try running the query with len(user_name) > 1 (not sure about query syntax) ?
On Tue, May 3, 2011 at 7:02 PM, Steven Wong <[EMAIL PROTECTED]> wrote:
> I have way more than 256 partitions per table. AFAIK, there is no partition > limit. > > > > From your stack trace, you have some host name issue somewhere. > > > > > > *From:* Time Less [mailto:[EMAIL PROTECTED]] > *Sent:* Tuesday, May 03, 2011 6:52 PM > *To:* [EMAIL PROTECTED] > *Subject:* Maximum Number of Hive Partitions = 256? > > > > I created a partitioned table, partitioned daily. If I query the earlier > partitions, everything works. The later ones fail with error: > > hive> select substr(user_name,1,1),count(*) from u_s_h_b where > dtpartition='2010-10-24' group by substr(user_name,1,1) ; > Total MapReduce jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks not specified. Estimated from input data size: 1 > In order to change the average load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer=<number> > In order to limit the maximum number of reducers: > set hive.exec.reducers.max=<number> > In order to set a constant number of reducers: > set mapred.reduce.tasks=<number> > java.lang.ArrayIndexOutOfBoundsException: 0 > at > org.apache.hadoop.mapred.FileInputFormat.identifyHosts(FileInputFormat.java:556) > at > org.apache.hadoop.mapred.FileInputFormat.getSplitHosts(FileInputFormat.java:524) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:235) > ......snip....... > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:186) > Job Submission failed with exception > 'java.lang.ArrayIndexOutOfBoundsException(0)' > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.MapRedTask > > It turns out that 2010-10-24 is 257 days from the very first partition in > my dataset (2010-01-09): > > | date_sub('2010-10-24',interval 257 day) | > +-----------------------------------------+ > | 2010-02-09 | > > That seems like an interesting coincidence. But try as I might, the Great > Googles will not show me a way to tune this, or even if it is tuneable, or > expected. Has anyone else run into a 256-partition limit in Hive? How do you > work around it? Why is that even the limit?! Shouldn't it be more like > 32-bit maxint??!! > > Thanks! > > -- > Tim Ellis > Riot Games >
+
Viral Bajaria 2011-05-04, 02:53
-
Re: Maximum Number of Hive Partitions = 256?
Time Less 2011-05-04, 17:53
> I am sure the issue has something to do with an empty string passed to the > substr function. We can rule out the substr() function. I get the same stack trace with any query like:
hive> select <anyColumn> from ushb where dtpartition='2010-10-25' limit 10;
But this query succeeds:
hive> select * from ushb where dtpartition='2010-10-25' limit 10 ;
So SOMETHING about the data makes Hive (Hadoop?) unhappy. More specifically something about trying to select a particular column from the data on some days. I'm looking at the data to see if I can sort what it is.
I have way more than 256 partitions per table. AFAIK, there is no partition > limit. > > From your stack trace, you have some host name issue somewhere. >
I see why you'd think that from the stack trace, though I can't imagine why it'd have a "host name issue somewhere." The partition create statements have no hostname component. The query has no hostname component.
This is definitely a curious problem.
-- Tim Ellis Riot Games
+
Time Less 2011-05-04, 17:53
-
Re: Maximum Number of Hive Partitions = 256?
Time Less 2011-05-04, 18:11
> This is definitely a curious problem. >
It's data corruption. The file is tab-separated, so I created a quick Perl pipe to print out the number of tabs on a given line:
-bash-3.2$ hadoop fs -cat /user/hive/warehouse/ushb/2010-10-25/data-2010-10-25 | perl -pe 's/[^\t\n]//g' | perl -pe 's/\t/-/g' | sort | uniq -c
The STDOUT was slightly disturbing:
1 -- 1552318 -------
The STDERR moreso:
11/05/04 11:07:49 INFO hdfs.DFSClient: No node available for block: blk_-1511269407958713809_10494 file=/user/hive/warehouse/ushb/2010-10-25/data-2010-10-25 11/05/04 11:07:49 INFO hdfs.DFSClient: Could not obtain block blk_-1511269407958713809_10494 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... 11/05/04 11:07:52 INFO hdfs.DFSClient: No node available for block: blk_-1511269407958713809_10494 file=/user/hive/warehouse/ushb/2010-10-25/data-2010-10-25 11/05/04 11:07:52 INFO hdfs.DFSClient: Could not obtain block blk_-1511269407958713809_10494 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... 11/05/04 11:07:58 WARN hdfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_-1511269407958713809_10494 file=/user/hive/warehouse/ushb/2010-10-25/data-2010-10-25 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1977) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1784) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1932) at java.io.DataInputStream.read(DataInputStream.java:83) (...etc) cat: Could not obtain block: blk_-1511269407958713809_10494 file=/user/hive/warehouse/ushb/2010-10-25/data-2010-10-25
-- Tim Ellis Riot Games
+
Time Less 2011-05-04, 18:11
-
Re: Maximum Number of Hive Partitions = 256?
Time Less 2011-05-04, 17:58
> It turns out that 2010-10-24 is 257 days from the very first partition in > my dataset (2010-01-09): > > | date_sub('2010-10-24',interval 257 day) | > +-----------------------------------------+ > | 2010-02-09 | >
I just noticed 257 days is FEBRUARY 9th, not JANUARY 9th, as the above shows. So there isn't even a 256ness of this problem in the first place. The human brain tends to pay attention to beginnings and ends of strings, ignoring the middle.
-- Tim Ellis Riot Games
+
Time Less 2011-05-04, 17:58
|
|