๐ฐgrapheme cluster
Last updated
Was this helpful?
Last updated
Was this helpful?
JS โฉ value โฉ primitive โฉ String โฉ Unicode โฉ grapheme cluster
the real characters/symbols, as displayed on screen or paper, may be composed of one or more code points.
'/u
' flag doesn't do well with grapheme clustersโ
/<.>/u.test("<๐ณ๏ธโ๐>"), // falseโ๏ธ
/<....>/u.test("<๐ณ๏ธโ๐>"), // true โญ๏ธ
when working with text, itโs best to split at the boundaries of grapheme clusters, not at the boundaries of Unicode characters (code points)โ
โข ๐ Grapheme Clusters, ๐ Atoms of Text
โข when talking about the actual rendered image, the term glyph is used.
โข unless you have very specific requirements or are able to query the font, use an API that segments strings into grapheme clusters wherever you need to deal with the notion of โcharacterโ. ๐ grapheme-splitter - GitHub โญ๏ธ
โข there are only two languages which handle this well: Swift and Perl 6.
replit โฉ /u flag, require โฉ String extension
// โญ import
const _String = require('./ext/String_ext.js'); // String extension
[
'๐'.codeUnits, // [ 55356, 57166 ] // surrogate pair (2 code units)
'๐'.codePoints, // [ 127822 ] // 1 code point
'๐ณ๏ธโ๐'.codeUnits, // [ 55356, 57331, 65039, 8205, 55356, 57096 ]
'๐ณ๏ธโ๐'.codePoints, // [ 127987, 65039, 8205, 127752 ] // 4 code pointsโ๏ธ
// --------------------------------------------
// โ๏ธregex works on "code units", by default.
// --------------------------------------------
/๐{3}/.test("๐๐๐"), // falseโ๏ธ
// let ๐ = ab (where a = 55356, b = 57166)
// then /๐{3}/ = /ab{3}/ = /abbb/โ๏ธ
// which is not /๐๐๐/ = /ababab/โ๏ธ
/<.>/.test("<๐>"), // falseโ๏ธ
// <๐> = <ab>, which is not <.>
// --------------------
// โ
enable /u flag
// --------------------
/<.>/u.test("<๐>"), // true โญ๏ธ
// โ๏ธ'/u' flag doesn't do well with grapheme clusters.
/<.>/u.test("<๐ณ๏ธโ๐>"), // falseโ๏ธ
/<....>/u.test("<๐ณ๏ธโ๐>"), // true โญ๏ธ
// โญ๏ธ ๆๅฐใๆผขๅญใ
`Hello ะัะธะฒะตั ไฝ ๅฅฝ`.match(/\p{sc=Han}/gu), // [ 'ไฝ ', 'ๅฅฝ' ]
// โญ๏ธ Script
/\p{Script=Greek}/u.test("ฮฑ"), // โ true
/\p{Script=Arabic}/u.test("ฮฑ"), // โ false
// โญ๏ธ Alphabetic
/\p{Alphabetic}/u.test("ฮฑ"), // โ true
/\p{Alphabetic}/u.test("!"), // โ false
/\p{Alphabetic}/u.test("ๆผข"), // โ true
].forEach(x => console.log(x));