Question marks - "?" in UTF-8 to ASCII conversion
Environment
- Red Hat Enterprise Linux (RHEL) - all versions
- iconv
Issue
Converting some UTF-8
characters to ASCII
results in ?
characters, for example:
$ iconv -f UTF-8 -t ASCII//TRANSLIT <<< 'I❤️ASCII ЯRавсде áčďéěíňóřšťú'
I?ASCII ?R????? acdeeinorstu
Resolution
Create an additional set of custom rules for problematic characters:
$ echo 'I❤️ASCII ЯRавсде áčďéěíňóřšťú' | sed 'y/авсд/avsd/; s/❤️/(heart)/g; s/Я/JA/g; s/е/je/g; ' | iconv -f UTF-8 -t ASCII//TRANSLIT
I(heart)ASCII JARavsdje acdeeinorstu
For better clarity and maintenance, write the rules in the conversion.sed
file:
# transliteration 1:1
y/авсд/avsd/
# transliteration 1:n
s/❤️/(heart)/g
s/Я/JA/g
s/е/je/g
# ...
# ...
# ...
Example of use:
$ echo 'I❤️ASCII ЯRавсде áčďéěíňóřšťú' | sed -f conversion.sed | iconv -f UTF-8 -t ASCII//TRANSLIT
I(heart)ASCII JARavsdje acdeeinorstu
Note: conversion may not be fully automatic because the conversion.sed
file depend on the user
Root Cause
Some characters cannot be transliterated unambiguously, so they are transliterated as ?
, see the iconv
man page:
If the string `//TRANSLIT` is appended to to-encoding, characters being converted are transliterated when needed and possible. This means that when a character cannot be represented in the target character set, it can be approximated through one or several similar looking characters. Characters that are outside of the target character set and cannot be transliterated are replaced with a question mark (?) in the output.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments