Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Bug in Hive Split function (Tested on Hive 0.9 and 0.11)


+
John Omernik 2013-10-09, 14:44
Copy link to this message
-
Re: Bug in Hive Split function (Tested on Hive 0.9 and 0.11)
I opened a JIRA on this: https://issues.apache.org/jira/browse/HIVE-5506
On Wed, Oct 9, 2013 at 9:44 AM, John Omernik <[EMAIL PROTECTED]> wrote:

> Hello all, I think I have outlined a bug in the hive split function:
>
> Summary: When calling split on a string of data, it will only return all
> array items if the the last array item has a value. For example, if I have
> a string of text delimited by tab with 7 columns, and the first four are
> filled, but the last three are blank, split will only return a 4 position
> array. If  any number of "middle" columns are empty, but the last item
> still has a value, then it will return the proper number of columns.  This
> was tested in Hive 0.9 and hive 0.11.
>
> Data:
> (Note \t represents a tab char, \x09 the line endings should be \n (UNIX
> style) not sure what email will do to them).  Basically my data is 7 lines
> of data with the first 7 letters separated by tab.  On some lines I've left
> out certain letters, but kept the number of tabs exactly the same.
>
> input.txt
> a\tb\tc\td\te\tf\tg
> a\tb\tc\td\te\t\tg
> a\tb\t\td\t\tf\tg
> \t\t\td\te\tf\tg
> a\tb\tc\td\t\t\t
> a\t\t\t\te\tf\tg
> a\t\t\td\t\t\tg
>
> I then created a table with one column from that data:
>
>
> DROP TABLE tmp_jo_tab_test;****
>
> CREATE table tmp_jo_tab_test (message_line STRING)****
>
> STORED AS TEXTFILE;****
>
> ** **
>
> LOAD DATA LOCAL INPATH '/tmp/input.txt'****
>
> OVERWRITE INTO TABLE tmp_jo_tab_test;
>
>
> Ok just to validate I created a python counting script:
>
>
> #!/usr/bin/python****
>
> ** **
>
> import sys****
>
> ** **
>
> ** **
>
> for line in sys.stdin:****
>
>     line = line[0:-1]****
>
>     out = line.split("\t")****
>
>     print len(out)
>
>
> The output there is :
>
> $ cat input.txt |./cnt_tabs.py****
>
> 7****
>
> 7****
>
> 7****
>
> 7****
>
> 7****
>
> 7****
>
> 7
>
>
> Based on that information, split on tab should return me 7 for each line
> as well:
>
>
> hive -e "select size(split(message_line, '\\t')) from tmp_jo_tab_test;"***
> *
>
> ** **
>
> 7****
>
> 7****
>
> 7****
>
> 7****
>
> 4****
>
> 7****
>
> 7
>
>
> However it does not.  It would appear that the line where only the first
> four letters are filled in(and blank is passed in on the last three) only
> returns 4 splits, where there should technically be 7, 4 for letters
> included, and three blanks.
>
>
> a\tb\tc\td\t\t\t
>
>
>
>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB