What is Unicode?

SMS defines a 160-character limit on the length of a single text message — but what’s a character? Here’s a simplified explanation; we’re not going too deep into the details, because they involve finer points than most people need to know, but a few Google searches can lead you down some interesting rabbit holes.

There are two common ways of sending alphabetic characters over SMS. The GSM-7 character encoding standard was developed by GSM Corp. in the 1980s for messaging with pagers and later adopted for SMS. It defines the most commonly used letters and symbols in English and many Western European languages in seven bits each for usage on GSM networks. GSM-7 supports 127 or 128 characters. The original standard called for 128 8-bit octets, or bytes. It was lengthened to 140 bytes, which supports 160 7-bit GSM-7 characters.

Of course 128 characters aren’t enough to represent all of the characters in use in language around the world. To do that, a working group called Unicode created the Universal Coded Character Set (UCS-2), a standard developed in 1987 that defines characters in two 8-bit bytes. That’s 216, which works out to 65,536 characters. UCS-2 was adopted as part of the ISO/IEC 10646 standard in 1990, and was informally called Unicode.

UCS-2 offered more characters than GSM-7, but it was just a stopgap. In 1991 the Unicode working group that developed UCS-2 founded Unicode, Inc., as a nonprofit organization whose mission was to “enable people around the world to use computers in any language, by providing freely-available specifications and data to form the foundation for software internationalization.” The consortium issued a better encoding standard, UTF-8, in 1993. (UTF stands for Unicode Transformation Format.) UCS-2 is now obsolete as an independent standard, and UTF-8 has become the most common encoding standard for characters on electronic devices. 

Unicode assigns a unique code — a code point — to each character. Code points are represented by a U followed by a unique string that represents a hexadecimal number. For example, U+0041 is the code point for “A.”

UTF-8 is efficient in the use of bits. it uses variable length encoding, which means if a character can be represented with a single byte, that’s all the space UTF-8 will use. If a character needs two or more bytes, UTF-8 will use as much as is necessary. 

UTF-8 was followed by UTF-16 in 1996. The current Unicode standard defines a possible 1,114,112 code points,grouped into 17 planes. UCS-2 occupies what’s formally called the basic multilingual plane, so it lives on.

We’re not going to dive any deeper into the technology and history of Unicode, other than to say version 15.0, the current version as of September 2022, defines 149,186 characters, including not only the characters for every language represented electronically but also a wide assortment of emoji. Every new version adds more emoji.

But let’s bring our characters back into text messaging. In the early 1990s, the telecom industry was developing Short Messaging Service (SMS). At the time, UCS-2 was the best alternative to GSM-7, so it was incorporated into SMS standards — and there it remains today. 

Why do you need to care about any of this technical underpinning? In a word, cost. UCS-2 characters take 16 bits to encode characters instead of the seven bits used by the GSM character set, so when a message includes any UCS-2 character, it can have a maximum of 70 characters. Just as with GSM characters, you can send messages longer than the limit, but SMS has to chop them into multiple segments, send them separately, and concatenate them back together at the receiving end. The more messages you send, the more costs you incur.