کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
453462 | 694863 | 2006 | 10 صفحه PDF | دانلود رایگان |

ISO 10646 Universal Character Set (UCS) covers symbols in most of the world's written languages. There are various UCS transformation formats (UTF), but UTF-8 is the most important one because of its compatibility with both software systems and communication systems that assume 8-bit characters. At first, three properties an UTF-8-like transformation format should satisfy are defined to preserve the main characteristics of UTF-8. Then, a derived 5-byte sequence with 31 free bits is illustrated to construct an UTF-8-like transformation format, which is capable of resolving the dummy byte sequences locally. After that, we try to reveal if the last byte patterns of the 3-byte and 4-byte sequences in the UTF-8-like transformation format are replaced with byte pattern 1xxxxxxx, two more free bits for the 3-byte and 4-byte sequences can be increased. The final version of the derived UTF-8-like transformation format, UTF-8M, is proved to have the minimal average storage of encoding an UCS-4 character, 16.3% less than what UTF-8 requires.
Journal: Computer Standards & Interfaces - Volume 28, Issue 6, September 2006, Pages 650–659