🔰Unicode
🚧 under construction -> tidy this page
JS ⟩ value ⟩ primitive ⟩ String ⟩ Unicode
every Unicode character is assigned a code point.
code points are divided into 17 code planes.
one or more code points can be combined into a single grapheme cluster.
character encoding transforms code points into code units.
most JavaScript engines use UTF-16 encoding.
By default, regular expressions work on code units, not actual characters❗
Characters that are composed of two code units (e.g. emojis) behave strangely.
Enable "
/u" flag to support Unicode in regular expressions.
str.slice2() - returns substring (support Unicode).
String extension - str.codeUnits, str.codePoints ...
Unicode Utilities:
Character Properties - list all properties by a character
UnicodeSet - list all characters by a property
\p{Script=Han}
\p{sc=Han}
漢字
\p{Letter}
\p{L}
a letter in any language.
\p{Number}
\p{N}
digit
\p{Po}
punctuation (others)
“,” 就屬於此類 ⭐️
codepen ⟩ regex: Unicode
main categories and subcategories
Letter
L:lowercase
Llmodifier
Lm,titlecase
Lt,uppercase
Lu,other
Lo.
Number
N:decimal digit
Nd,letter number
Nl,other
No.
Punctuation
P:connector
Pc,dash
Pd,initial quote
Pi,final quote
Pf,open
Ps,close
Pe,other
Po.
Mark
M(accents etc):spacing combining
Mc,enclosing
Me,non-spacing
Mn.
Symbol
S:currency
Sc,modifier
Sk,math
Sm,other
So.
Separator
Z:line
Zl,paragraph
Zp,space
Zs.
Other
C:control
Cc,format
Cf,not assigned
Cn,private use
Co,surrogate
Cs.
Last updated
Was this helpful?