🔰Unicode
🚧 under construction -> tidy this page
JS ⟩ value ⟩ primitive ⟩ String ⟩ Unicode
every Unicode character is assigned a code point.
code points are divided into 17 code planes.
one or more code points can be combined into a single grapheme cluster.
character encoding transforms code points into code units.
most JavaScript engines use UTF-16 encoding.
By default, regular expressions work on code units, not actual characters❗
Characters that are composed of two code units (e.g. emojis) behave strangely.
Enable "
/u" flag to support Unicode in regular expressions.
str.slice2() - returns substring (support Unicode).
String extension - str.codeUnits, str.codePoints ...
Unicode Utilities:
Character Properties - list all properties by a character
UnicodeSet - list all characters by a property
const {log} = console;
const convert = require('./Converter.js');
[
convert.stringToCodeUnits("🍎"), // [ 55356=a, 57166=b ]
/🍎{3}/.test("🍎🍎🍎"), // false❗️
// assume 🍎 = ab (2 code units)
// 🍎{3} = ab{3} = abbb ≠ ababab = 🍎🍎🍎
convert.stringToCodeUnits("🌹"), // [ 55356=a, 57145=c ]
/<.>/.test("<🌹>"), // false❗️
// assumn 🌹 = ac (2 code units)
// <.> != <ac>
/<.>/u.test("<🌹>"), // true ⭐️
// ✅ enable /u flag
// 搜尋「漢字」
`Hello Привет 你好`.match(/\p{sc=Han}/gu), // [ '你', '好' ]
// Script
/\p{Script=Greek}/u.test("α"), // → true
/\p{Script=Arabic}/u.test("α"), // → false
// Alphabetic
/\p{Alphabetic}/u.test("α"), // → true
/\p{Alphabetic}/u.test("!"), // → false
/\p{Alphabetic}/u.test("漢"), // → true
].forEach(x => log(x));const {log} = console;
/*
- Alpha (Alphabetic) : letters
- M (Mark) : accents
- Nd (Decimal_Number) : digits
- Pc (Connector_Punctuation): underscore '_' and similar characters,
- Po (Punctuation others) : ?
- Join_C (Join_Control) : (200c, 200d) used in ligatures, e.g. in Arabic.
*/
// multi-language "word" character (like "\w", but in a Unicode sense)
let char = /[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]/gu;
let Po = /[\p{Po}]/gu;
let P = /[\p{P}]/gu;
let str = `Hi、你好,(+US$12.34-)。`;
[
str.match(char), // ["H","i","你","好","U","S","1","2","3","4"]
str.match(Po), // ["、",",",".","。"] ⭐️
str.match(P), // ["、",",","(",".","-",")","。"]
].forEach(x => log(x));j\p{Script=Han}
\p{sc=Han}
漢字
\p{Letter}
\p{L}
a letter in any language.
\p{Number}
\p{N}
digit
\p{Po}
punctuation (others)
“,” 就屬於此類 ⭐️
codepen ⟩ regex: Unicode
main categories and subcategories
Letter
L:lowercase
Llmodifier
Lm,titlecase
Lt,uppercase
Lu,other
Lo.
Number
N:decimal digit
Nd,letter number
Nl,other
No.
Punctuation
P:connector
Pc,dash
Pd,initial quote
Pi,final quote
Pf,open
Ps,close
Pe,other
Po.
Mark
M(accents etc):spacing combining
Mc,enclosing
Me,non-spacing
Mn.
Symbol
S:currency
Sc,modifier
Sk,math
Sm,other
So.
Separator
Z:line
Zl,paragraph
Zp,space
Zs.
Other
C:control
Cc,format
Cf,not assigned
Cn,private use
Co,surrogate
Cs.
Last updated
Was this helpful?