Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> regexp_replace with unicode chars


Copy link to this message
-
Re: regexp_replace with unicode chars
The translate UDF does take care of non-ascii characters. It uses
codepoints instead of characters.

Here is the unit test to demonstrate that:
https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/udf_translate.q#L36

But you guys are right. It doesn't solve Tom's original problem
because it doesn't take ranges.

I created HIVE-4100 for improving regex_replace UDF.
Mark

On Fri, Mar 1, 2013 at 9:31 AM, Dean Wampler
<[EMAIL PROTECTED]> wrote:
> Anyone know if translate takes ranges, like some implementations? e.g.,
>
> translate ('[a-z]', '[A-Z]')
>
> Of course, that probably doesn't work for non-ascii characters.
>
>
> On Fri, Mar 1, 2013 at 11:24 AM, Tom Hall <[EMAIL PROTECTED]> wrote:
>>
>> Thanks Dean,
>>
>> I dont think translate would work as the set of things to remove is
>> massive.
>> Yeah, it's a one-off cleanup job while exporting to try redshift on our
>> datasets.
>> My guess is it's something about the way hive handles strings? Tried
>> "\\ufffd" as the replacement str but no joy either.
>>
>> Cheers again,
>> Tom
>>
>>
>>
>> On 1 March 2013 17:08, Dean Wampler <[EMAIL PROTECTED]>
>> wrote:
>>>
>>> I think this should work, but you might investigate using the translate
>>> function instead. I suspect it will provide much better performance than
>>> using regexps. Also, Are you planning to do this once to create your final
>>> tables? If so, the performance overhead won't matter much.
>>>
>>> dean
>>>
>>>
>>> On Fri, Mar 1, 2013 at 10:52 AM, Tom Hall <[EMAIL PROTECTED]>
>>> wrote:
>>>>
>>>> I would like to remove unicode chars that are outside the Basic
>>>> Multilingual Plane [1]
>>>>
>>>> I thought
>>>> select regexp_replace(some_column,"[^\\u0000-\\uffff]","\ufffd") from
>>>> my_table
>>>> would work but while the regexp does work the replacement str does not
>>>> (I can paste in the literal �, which you may or may not be able to see here
>>>> but it somehow did not fell right)
>>>>
>>>> I saw Deans previous post on using octals [2] but I think \ufffd is
>>>> outside the allowable range.
>>>>
>>>> Cheers,
>>>> Tom
>>>>
>>>>
>>>> [1]
>>>> http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane
>>>> [2]
>>>> http://grokbase.com/t/hive/dev/131a4n562y/unicode-character-as-delimiter
>>>
>>>
>>>
>>>
>>> --
>>> Dean Wampler, Ph.D.
>>> thinkbiganalytics.com
>>> +1-312-339-1330
>>>
>>
>
>
>
> --
> Dean Wampler, Ph.D.
> thinkbiganalytics.com
> +1-312-339-1330
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB