# Unicode

[JS](https://lochiwei.gitbook.io/web/js) ⟩ [value](https://lochiwei.gitbook.io/web/js/val) ⟩ [primitive](https://lochiwei.gitbook.io/web/js/val/prim) ⟩ [String](https://lochiwei.gitbook.io/web/js/val/prim/str) ⟩ Unicode

{% hint style="success" %}

* every <mark style="color:purple;">**Unicode**</mark> <mark style="color:yellow;">**character**</mark> is assigned a [**code point**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/code-point).&#x20;
* [**code points**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/code-point) are divided into <mark style="color:yellow;">**17**</mark> [**code planes**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/code-plane).
* one or more [**code points**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/code-point) can be <mark style="color:yellow;">**combined**</mark> into a single [**grapheme cluster**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/grapheme-cluster).
* [**character encoding**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/encode) transforms [**code points**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/code-point) into [**code units**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/encode/utf16/code-unit).&#x20;
* most <mark style="color:yellow;">**JavaScript**</mark> engines use [<mark style="color:yellow;">**UTF-16**</mark>](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/encode/utf16) encoding.&#x20;
  {% endhint %}

{% tabs %}
{% tab title="🗺️" %} <img src="https://2527454625-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MfvEFZnSBhKT6fJmus0%2Fuploads%2FeAPkVT4OexgWTMyevBZY%2Fcode.point.svg?alt=media&#x26;token=c612329d-0a48-4cfc-9949-764991d69905" alt="Unicode" class="gitbook-drawing">
{% endtab %}

{% tab title="🧨" %}
{% hint style="danger" %}

* By default, [**regular expressions**](https://lochiwei.gitbook.io/web/js/val/builtin/regex) work on [**code units**](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/encode/utf16/code-unit), <mark style="color:red;">**not**</mark>**&#x20;**<mark style="color:yellow;">**actual characters**</mark>:exclamation:
* Characters that are composed of **two code units** (e.g. **emojis**) behave strangely.
* Enable "<mark style="color:yellow;">`/u`</mark>" [**flag**](https://lochiwei.gitbook.io/web/js/val/builtin/regex/flag) to support <mark style="color:purple;">**Unicode**</mark> in [**regular expressions**](https://lochiwei.gitbook.io/web/js/val/builtin/regex).
  {% endhint %}
  {% endtab %}

{% tab title="🔴" %}

* [code-unit](https://lochiwei.gitbook.io/web/js/val/prim/str/unicode/encode/utf16/code-unit "mention")
* [str.slice2](https://lochiwei.gitbook.io/web/js/val/prim/str/method/str.slice2 "mention") - returns substring (support Unicode).
  {% endtab %}

{% tab title="👥 相關" %}

* [ext](https://lochiwei.gitbook.io/web/js/val/prim/str/ext "mention") - str.codeUnits, str.codePoints ...
  {% endtab %}

{% tab title="🛠" %}

* Unicode Utilities:&#x20;
  * [Character Properties](https://util.unicode.org/UnicodeJsps/character.jsp) - list all properties by a character
  * [UnicodeSet](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp) - list all characters by a property
  * [Property Aliases](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)
  * [Emojis](https://www.unicode.org/Public/emoji/13.1/emoji-test.txt)
    {% endtab %}

{% tab title="📗" %}

* [ ] 📗 [JS.info - flag "/u" & class \p={...}](https://javascript.info/regexp-unicode)
* [x] 📗 [JS.info - multi-language "\w"](https://javascript.info/regexp-character-sets-and-ranges#example-multi-language-w)
* [x] 📗 [Eloquent JS - International Characters](https://eloquentjavascript.net/09_regexp.html#h_+y54//b0l+)
* [ ] ExploringJS ⟩ [Atoms of Text - code points, JS characters, grapheme clusters](https://exploringjs.com/impatient-js/ch_strings.html#atoms-of-text)
  {% endtab %}

{% tab title="📘" %}

* [JavaScript Guide](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide) ⟩ [Regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions) ⟩ [Unicode property escapes](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Cheatsheet#unicode_property_escapes)
  {% endtab %}

{% tab title="🗣" %}

* [Javascript unicode string, chinese character but no punctuation](https://stackoverflow.com/questions/21109011/javascript-unicode-string-chinese-character-but-no-punctuation)
  {% endtab %}
  {% endtabs %}

{% hint style="info" %}
•  **Script** (a writing system) = Cyrillic, Greek, Arabic, **Han** (Chinese) ...  👉 [full list](https://en.wikipedia.org/wiki/Script_\(Unicode\)))
{% endhint %}

{% tabs %}
{% tab title="JS" %}

```javascript
const {log} = console;
const convert = require('./Converter.js');                

[
    convert.stringToCodeUnits("🍎"),    // [ 55356=a, 57166=b ]

    /🍎{3}/.test("🍎🍎🍎"),              // false❗️
    // assume 🍎 = ab (2 code units)
    // 🍎{3} = ab{3} = abbb ≠ ababab = 🍎🍎🍎

    convert.stringToCodeUnits("🌹"),    // [ 55356=a, 57145=c ]

    /<.>/.test("<🌹>"),                 // false❗️
    // assumn 🌹 = ac (2 code units)
    // <.> != <ac>

    /<.>/u.test("<🌹>"),                // true ⭐️
    // ✅ enable /u flag
    

    // 搜尋「漢字」
    `Hello Привет 你好`.match(/\p{sc=Han}/gu),  // [ '你', '好' ]

    // Script
    /\p{Script=Greek}/u.test("α"),      // → true
    /\p{Script=Arabic}/u.test("α"),     // → false

    // Alphabetic
    /\p{Alphabetic}/u.test("α"),        // → true
    /\p{Alphabetic}/u.test("!"),        // → false
    /\p{Alphabetic}/u.test("漢"),       // → true

].forEach(x => log(x));
```

{% endtab %}

{% tab title="JS" %}

```javascript
const {log} = console;

/*
    - Alpha  (Alphabetic)           : letters
    - M      (Mark)                 : accents
    - Nd     (Decimal_Number)       : digits
    - Pc     (Connector_Punctuation): underscore '_' and similar characters,
    - Po     (Punctuation others)   : ?
    - Join_C (Join_Control)         : (200c, 200d) used in ligatures, e.g. in Arabic.
*/

// multi-language "word" character (like "\w", but in a Unicode sense)
let char = /[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]/gu;
let Po = /[\p{Po}]/gu;
let P = /[\p{P}]/gu;

let str = `Hi、你好，(+US$12.34-)。`;

[
    str.match(char),        // ["H","i","你","好","U","S","1","2","3","4"]
    str.match(Po),          // ["、","，",".","。"] ⭐️
    str.match(P),           // ["、","，","(",".","-",")","。"]
    
].forEach(x => log(x));j
```

{% endtab %}

{% tab title="properties" %}

<table><thead><tr><th width="188.16014234875445">property</th><th width="162">alias</th><th>meaning</th></tr></thead><tbody><tr><td>\p{Script=Han}</td><td>\p{sc=Han}</td><td>漢字</td></tr><tr><td>\p{Letter}</td><td>\p{L}</td><td>a letter in <strong>any language</strong>.</td></tr><tr><td>\p{Number}</td><td>\p{N}</td><td>digit</td></tr><tr><td></td><td>\p{Po}</td><td><p>punctuation (others)</p><p>“，” 就屬於此類 ⭐️</p></td></tr></tbody></table>
{% endtab %}

{% tab title="📗 參考" %}
💾 [international characters](https://replit.com/@pegasusroe/international-characters#index.js)

📗 [🍎 - EmojiAll](https://www.emojiall.com/zh-hant/emoji/%F0%9F%8D%8E)
{% endtab %}

{% tab title="💾 程式" %}

* codepen ⟩ [regex: Unicode](https://codepen.io/lochiwei/pen/OJjNOPo?editors=0012)
  {% endtab %}
  {% endtabs %}

## main categories and subcategories

* Letter `L`:
  * lowercase `Ll`
  * modifier `Lm`,
  * titlecase `Lt`,
  * uppercase `Lu`,
  * other `Lo`.
* Number `N`:
  * decimal digit `Nd`,
  * letter number `Nl`,
  * other `No`.
* Punctuation `P`:
  * connector `Pc`,
  * dash `Pd`,
  * initial quote `Pi`,
  * final quote `Pf`,
  * open `Ps`,
  * close `Pe`,
  * other `Po`.
* Mark `M` (accents etc):
  * spacing combining `Mc`,
  * enclosing `Me`,
  * non-spacing `Mn`.
* Symbol `S`:
  * currency `Sc`,
  * modifier `Sk`,
  * math `Sm`,
  * other `So`.
* Separator `Z`:
  * line `Zl`,
  * paragraph `Zp`,
  * space `Zs`.
* Other `C`:
  * control `Cc`,
  * format `Cf`,
  * not assigned `Cn`,
  * private use `Co`,
  * surrogate `Cs`.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://lochiwei.gitbook.io/web/js/val/prim/str/unicode.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
