Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> regexp_replace with unicode chars


Copy link to this message
-
Re: regexp_replace with unicode chars
Anyone know if translate takes ranges, like some implementations? e.g.,

translate ('[a-z]', '[A-Z]')

Of course, that probably doesn't work for non-ascii characters.

On Fri, Mar 1, 2013 at 11:24 AM, Tom Hall <[EMAIL PROTECTED]> wrote:

> Thanks Dean,
>
> I dont think translate would work as the set of things to remove is
> massive.
> Yeah, it's a one-off cleanup job while exporting to try redshift on our
> datasets.
> My guess is it's something about the way hive handles strings? Tried
> "\\ufffd" as the replacement str but no joy either.
>
> Cheers again,
> Tom
>
>
>
> On 1 March 2013 17:08, Dean Wampler <[EMAIL PROTECTED]>wrote:
>
>> I think this should work, but you might investigate using the translate
>> function instead. I suspect it will provide much better performance than
>> using regexps. Also, Are you planning to do this once to create your final
>> tables? If so, the performance overhead won't matter much.
>>
>> dean
>>
>>
>> On Fri, Mar 1, 2013 at 10:52 AM, Tom Hall <[EMAIL PROTECTED]>wrote:
>>
>>> I would like to remove unicode chars that are outside the Basic
>>> Multilingual Plane [1]
>>>
>>> I thought
>>> select regexp_replace(some_column,"[^\\u0000-\\uffff]","\ufffd") from
>>> my_table
>>> would work but while the regexp does work the replacement str does not
>>> (I can paste in the literal �, which you may or may not be able to see here
>>> but it somehow did not fell right)
>>>
>>> I saw Deans previous post on using octals [2] but I think \ufffd is
>>> outside the allowable range.
>>>
>>> Cheers,
>>> Tom
>>>
>>>
>>> [1]
>>> http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane
>>> [2]
>>> http://grokbase.com/t/hive/dev/131a4n562y/unicode-character-as-delimiter
>>>
>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>
--
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330