Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Bug in Hive Split function (Tested on Hive 0.9 and 0.11)


Copy link to this message
-
Re: Bug in Hive Split function (Tested on Hive 0.9 and 0.11)
John Omernik 2013-10-09, 21:21
I opened a JIRA on this: https://issues.apache.org/jira/browse/HIVE-5506
On Wed, Oct 9, 2013 at 9:44 AM, John Omernik <[EMAIL PROTECTED]> wrote:

> Hello all, I think I have outlined a bug in the hive split function:
>
> Summary: When calling split on a string of data, it will only return all
> array items if the the last array item has a value. For example, if I have
> a string of text delimited by tab with 7 columns, and the first four are
> filled, but the last three are blank, split will only return a 4 position
> array. If  any number of "middle" columns are empty, but the last item
> still has a value, then it will return the proper number of columns.  This
> was tested in Hive 0.9 and hive 0.11.
>
> Data:
> (Note \t represents a tab char, \x09 the line endings should be \n (UNIX
> style) not sure what email will do to them).  Basically my data is 7 lines
> of data with the first 7 letters separated by tab.  On some lines I've left
> out certain letters, but kept the number of tabs exactly the same.
>
> input.txt
> a\tb\tc\td\te\tf\tg
> a\tb\tc\td\te\t\tg
> a\tb\t\td\t\tf\tg
> \t\t\td\te\tf\tg
> a\tb\tc\td\t\t\t
> a\t\t\t\te\tf\tg
> a\t\t\td\t\t\tg
>
> I then created a table with one column from that data:
>
>
> DROP TABLE tmp_jo_tab_test;****
>
> CREATE table tmp_jo_tab_test (message_line STRING)****
>
> STORED AS TEXTFILE;****
>
> ** **
>
> LOAD DATA LOCAL INPATH '/tmp/input.txt'****
>
> OVERWRITE INTO TABLE tmp_jo_tab_test;
>
>
> Ok just to validate I created a python counting script:
>
>
> #!/usr/bin/python****
>
> ** **
>
> import sys****
>
> ** **
>
> ** **
>
> for line in sys.stdin:****
>
>     line = line[0:-1]****
>
>     out = line.split("\t")****
>
>     print len(out)
>
>
> The output there is :
>
> $ cat input.txt |./cnt_tabs.py****
>
> 7****
>
> 7****
>
> 7****
>
> 7****
>
> 7****
>
> 7****
>
> 7
>
>
> Based on that information, split on tab should return me 7 for each line
> as well:
>
>
> hive -e "select size(split(message_line, '\\t')) from tmp_jo_tab_test;"***
> *
>
> ** **
>
> 7****
>
> 7****
>
> 7****
>
> 7****
>
> 4****
>
> 7****
>
> 7
>
>
> However it does not.  It would appear that the line where only the first
> four letters are filled in(and blank is passed in on the last three) only
> returns 4 splits, where there should technically be 7, 4 for letters
> included, and three blanks.
>
>
> a\tb\tc\td\t\t\t
>
>
>
>
>
>
>