Escaping utf8mb4 characters

I want to remove utf8mb4 characters from string β€œπŸŽ”πŸŽ”MessageπŸŽ”πŸŽ”β€ and insert into column which supports utf8/utf8mb3 column. I want the output as β€œMessage”. Also want to remove any other utf8mb4 chars in future. I tried SELECT REGEXP_REPLACE(β€˜πŸŽ”πŸŽ”MessageπŸŽ”πŸŽ”β€™, β€˜/[\u1F300-\u1F600]/’, β€˜β€™); this didnt remove the characters. I am getting U+1F394 currently in my string. Is there a way to do it?

It turns out that if you cast utfmb4 strings to utf8mb3, they will be truncated at the first non-utf8mb3 character.

You can use this to write a function to check if a character is an extended utf8mb4 character (one that is not utf8mb3), as follows:

delimiter //
create or replace function is_utf8mb4_extended_char(c varchar(1) collate utf8mb4_bin) returns bool as
begin
     return (length(c :> text character set utf8 collate utf8_bin) = 0);
end //
delimiter ;

-- so this
select is_utf8mb4_extended_char('X'), is_utf8mb4_extended_char('😜');

-- outputs 0, 1

You can then write another function that loops through all characters in a utf8mb4 string to create a version of that string with all the utf8mb4 extended characters removed, by concatenating the true utf8mb3 character together. If you get one working, please post it here for others to use :slight_smile:

It might be a good option to look at WASM.