Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> regexp_replace with unicode chars


Copy link to this message
-
Re: regexp_replace with unicode chars
I think this should work, but you might investigate using the translate
function instead. I suspect it will provide much better performance than
using regexps. Also, Are you planning to do this once to create your final
tables? If so, the performance overhead won't matter much.

dean

On Fri, Mar 1, 2013 at 10:52 AM, Tom Hall <[EMAIL PROTECTED]> wrote:

> I would like to remove unicode chars that are outside the Basic
> Multilingual Plane [1]
>
> I thought
> select regexp_replace(some_column,"[^\\u0000-\\uffff]","\ufffd") from
> my_table
> would work but while the regexp does work the replacement str does not (I
> can paste in the literal �, which you may or may not be able to see here
> but it somehow did not fell right)
>
> I saw Deans previous post on using octals [2] but I think \ufffd is
> outside the allowable range.
>
> Cheers,
> Tom
>
>
> [1]
> http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane
> [2]
> http://grokbase.com/t/hive/dev/131a4n562y/unicode-character-as-delimiter
>

--
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB