Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Bug in Hive Split function (Tested on Hive 0.9 and 0.11)


Copy link to this message
-
Bug in Hive Split function (Tested on Hive 0.9 and 0.11)
Hello all, I think I have outlined a bug in the hive split function:

Summary: When calling split on a string of data, it will only return all
array items if the the last array item has a value. For example, if I have
a string of text delimited by tab with 7 columns, and the first four are
filled, but the last three are blank, split will only return a 4 position
array. If  any number of "middle" columns are empty, but the last item
still has a value, then it will return the proper number of columns.  This
was tested in Hive 0.9 and hive 0.11.

Data:
(Note \t represents a tab char, \x09 the line endings should be \n (UNIX
style) not sure what email will do to them).  Basically my data is 7 lines
of data with the first 7 letters separated by tab.  On some lines I've left
out certain letters, but kept the number of tabs exactly the same.

input.txt
a\tb\tc\td\te\tf\tg
a\tb\tc\td\te\t\tg
a\tb\t\td\t\tf\tg
\t\t\td\te\tf\tg
a\tb\tc\td\t\t\t
a\t\t\t\te\tf\tg
a\t\t\td\t\t\tg

I then created a table with one column from that data:
DROP TABLE tmp_jo_tab_test;****

CREATE table tmp_jo_tab_test (message_line STRING)****

STORED AS TEXTFILE;****

** **

LOAD DATA LOCAL INPATH '/tmp/input.txt'****

OVERWRITE INTO TABLE tmp_jo_tab_test;
Ok just to validate I created a python counting script:
#!/usr/bin/python****

** **

import sys****

** **

** **

for line in sys.stdin:****

    line = line[0:-1]****

    out = line.split("\t")****

    print len(out)
The output there is :

$ cat input.txt |./cnt_tabs.py****

7****

7****

7****

7****

7****

7****

7
Based on that information, split on tab should return me 7 for each line as
well:
hive -e "select size(split(message_line, '\\t')) from tmp_jo_tab_test;"****

** **

7****

7****

7****

7****

4****

7****

7
However it does not.  It would appear that the line where only the first
four letters are filled in(and blank is passed in on the last three) only
returns 4 splits, where there should technically be 7, 4 for letters
included, and three blanks.
a\tb\tc\td\t\t\t
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB