Javascript, Unicode and strings

Or what are those weird symbols at 'content'

February 2023, last updated March 2023

What is that "Unicode" thing?

We can think of Unicode as a database that matches any symbol you can think of to a unique number (code point) and to a unique canonical name. This way you can refer to the symbol without using the symbol.

Normally Unicode code points follow this syntax:

A capital "U" followed by at least four digits (it's an hexadecimal number)

An example would be: U+0041 for symbol "A".

There are 1.1million possible symbols. These are divided in 17 categories.

The first category is "Basic Multilingual Plane" (BMP) and contains all that commonly used symbols (65k symbols).

All the rest are known as "astral planes" or "supplementary planes".

And how Unicode applies to js strings, then?

Javascript uses "unicode escape sequences" to map symbols. Probably you have seen this syntax before:

\x44 // can't contain many symbols, unfortunately. The biggest number is \xFF

// or this other syntax

\u0042 // this syntax can reference 2661 symbols (all included in the BMP)

The astral symbols follow this syntax:

\u{23}\u{79}\u{25} // we can reference all Unicode symbols this way

* Well, that is not 100% true, as this is ES6 syntax.

For ES5 javascript engines we need something called "surrogate pairs" 😛

Don't be shy, leave us a comment