# Unicode

[JS](https://lochiwei.gitbook.io/web/js) ⟩ [value](https://lochiwei.gitbook.io/web/js/val) ⟩ [primitive](https://lochiwei.gitbook.io/web/js/val/prim) ⟩ [String](https://lochiwei.gitbook.io/web/js/val/prim/str) ⟩ Unicode

{% hint style="success" %}

* every <mark style="color:purple;">**Unicode**</mark> <mark style="color:yellow;">**character**</mark> is assigned a [**code point**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/code-point).&#x20;
* [**code points**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/code-point) are divided into <mark style="color:yellow;">**17**</mark> [**code planes**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/code-plane).
* one or more [**code points**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/code-point) can be <mark style="color:yellow;">**combined**</mark> into a single [**grapheme cluster**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/grapheme-cluster).
* [**character encoding**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/encode) transforms [**code points**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/code-point) into [**code units**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/encode/utf16/code-unit).&#x20;
* most <mark style="color:yellow;">**JavaScript**</mark> engines use [<mark style="color:yellow;">**UTF-16**</mark>](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/encode/utf16) encoding.&#x20;
  {% endhint %}

{% tabs %}
{% tab title="🗺️" %} <img src="https://2527454625-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MfvEFZnSBhKT6fJmus0%2Fuploads%2FeAPkVT4OexgWTMyevBZY%2Fcode.point.svg?alt=media&#x26;token=c612329d-0a48-4cfc-9949-764991d69905" alt="Unicode" class="gitbook-drawing">
{% endtab %}

{% tab title="🧨" %}
{% hint style="danger" %}

* By default, [**regular expressions**](https://lochiwei.gitbook.io/web/js/val/builtin/regex) work on [**code units**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/encode/utf16/code-unit), <mark style="color:red;">**not**</mark>**&#x20;**<mark style="color:yellow;">**actual characters**</mark>:exclamation:
* Characters that are composed of **two code units** (e.g. **emojis**) behave strangely.
* Enable "<mark style="color:yellow;">`/u`</mark>" [**flag**](https://lochiwei.gitbook.io/web/js/val/builtin/regex/flag) to support <mark style="color:purple;">**Unicode**</mark> in [**regular expressions**](https://lochiwei.gitbook.io/web/js/val/builtin/regex).
  {% endhint %}
  {% endtab %}

{% tab title="🔴" %}

* [code-unit](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/encode/utf16/code-unit "mention")
* [str.slice2](https://lochiwei.gitbook.io/web/js/val/prim/str/method/str.slice2 "mention") - returns substring (support Unicode).
  {% endtab %}

{% tab title="👥 相關" %}

* [ext](https://lochiwei.gitbook.io/web/js/val/prim/str/ext "mention") - str.codeUnits, str.codePoints ...
  {% endtab %}

{% tab title="🛠" %}

* Unicode Utilities:&#x20;
  * [Character Properties](https://util.unicode.org/UnicodeJsps/character.jsp) - list all properties by a character
  * [UnicodeSet](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp) - list all characters by a property
  * [Property Aliases](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
  * [Emojis](https://www.unicode.org/Public/emoji/13.1/emoji-test.txt)
    {% endtab %}

{% tab title="📗" %}

* [ ] 📗 [JS.info - flag "/u" & class \p={...}](https://javascript.info/regexp-unicode)
* [x] 📗 [JS.info - multi-language "\w"](https://javascript.info/regexp-character-sets-and-ranges#example-multi-language-w)
* [x] 📗 [Eloquent JS - International Characters](https://eloquentjavascript.net/09_regexp.html#h_+y54//b0l+)
* [ ] ExploringJS ⟩ [Atoms of Text - code points, JS characters, grapheme clusters](https://exploringjs.com/impatient-js/ch_strings.html#atoms-of-text)
  {% endtab %}

{% tab title="📘" %}

* [JavaScript Guide](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide) ⟩ [Regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions) ⟩ [Unicode property escapes](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Cheatsheet#unicode_property_escapes)
  {% endtab %}

{% tab title="🗣" %}

* [Javascript unicode string, chinese character but no punctuation](https://stackoverflow.com/questions/21109011/javascript-unicode-string-chinese-character-but-no-punctuation)
  {% endtab %}
  {% endtabs %}

{% hint style="info" %}
•  **Script** (a writing system) = Cyrillic, Greek, Arabic, **Han** (Chinese) ...  👉 [full list](https://en.wikipedia.org/wiki/Script_\(Unicode\)))
{% endhint %}

{% tabs %}
{% tab title="JS" %}

```javascript
const {log} = console;
const convert = require('./Converter.js');                

[
    convert.stringToCodeUnits("🍎"),    // [ 55356=a, 57166=b ]

    /🍎{3}/.test("🍎🍎🍎"),              // false❗️
    // assume 🍎 = ab (2 code units)
    // 🍎{3} = ab{3} = abbb ≠ ababab = 🍎🍎🍎

    convert.stringToCodeUnits("🌹"),    // [ 55356=a, 57145=c ]

    /<.>/.test("<🌹>"),                 // false❗️
    // assumn 🌹 = ac (2 code units)
    // <.> != <ac>

    /<.>/u.test("<🌹>"),                // true ⭐️
    // ✅ enable /u flag
    

    // 搜尋「漢字」
    `Hello Привет 你好`.match(/\p{sc=Han}/gu),  // [ '你', '好' ]

    // Script
    /\p{Script=Greek}/u.test("α"),      // → true
    /\p{Script=Arabic}/u.test("α"),     // → false

    // Alphabetic
    /\p{Alphabetic}/u.test("α"),        // → true
    /\p{Alphabetic}/u.test("!"),        // → false
    /\p{Alphabetic}/u.test("漢"),       // → true

].forEach(x => log(x));
```

{% endtab %}

{% tab title="JS" %}

```javascript
const {log} = console;

/*
    - Alpha  (Alphabetic)           : letters
    - M      (Mark)                 : accents
    - Nd     (Decimal_Number)       : digits
    - Pc     (Connector_Punctuation): underscore '_' and similar characters,
    - Po     (Punctuation others)   : ?
    - Join_C (Join_Control)         : (200c, 200d) used in ligatures, e.g. in Arabic.
*/

// multi-language "word" character (like "\w", but in a Unicode sense)
let char = /[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]/gu;
let Po = /[\p{Po}]/gu;
let P = /[\p{P}]/gu;

let str = `Hi、你好，(+US$12.34-)。`;

[
    str.match(char),        // ["H","i","你","好","U","S","1","2","3","4"]
    str.match(Po),          // ["、","，",".","。"] ⭐️
    str.match(P),           // ["、","，","(",".","-",")","。"]
    
].forEach(x => log(x));j
```

{% endtab %}

{% tab title="properties" %}

<table><thead><tr><th width="188.16014234875445">property</th><th width="162">alias</th><th>meaning</th></tr></thead><tbody><tr><td>\p{Script=Han}</td><td>\p{sc=Han}</td><td>漢字</td></tr><tr><td>\p{Letter}</td><td>\p{L}</td><td>a letter in <strong>any language</strong>.</td></tr><tr><td>\p{Number}</td><td>\p{N}</td><td>digit</td></tr><tr><td></td><td>\p{Po}</td><td><p>punctuation (others)</p><p>“，” 就屬於此類 ⭐️</p></td></tr></tbody></table>
{% endtab %}

{% tab title="📗 參考" %}
💾 [international characters](https://replit.com/@pegasusroe/international-characters#index.js)

📗 [🍎 - EmojiAll](https://www.emojiall.com/zh-hant/emoji/%F0%9F%8D%8E)
{% endtab %}

{% tab title="💾 程式" %}

* codepen ⟩ [regex: Unicode](https://codepen.io/lochiwei/pen/OJjNOPo?editors=0012)
  {% endtab %}
  {% endtabs %}

## main categories and subcategories

* Letter `L`:
  * lowercase `Ll`
  * modifier `Lm`,
  * titlecase `Lt`,
  * uppercase `Lu`,
  * other `Lo`.
* Number `N`:
  * decimal digit `Nd`,
  * letter number `Nl`,
  * other `No`.
* Punctuation `P`:
  * connector `Pc`,
  * dash `Pd`,
  * initial quote `Pi`,
  * final quote `Pf`,
  * open `Ps`,
  * close `Pe`,
  * other `Po`.
* Mark `M` (accents etc):
  * spacing combining `Mc`,
  * enclosing `Me`,
  * non-spacing `Mn`.
* Symbol `S`:
  * currency `Sc`,
  * modifier `Sk`,
  * math `Sm`,
  * other `So`.
* Separator `Z`:
  * line `Zl`,
  * paragraph `Zp`,
  * space `Zs`.
* Other `C`:
  * control `Cc`,
  * format `Cf`,
  * not assigned `Cn`,
  * private use `Co`,
  * surrogate `Cs`.
