[ JavaScript ] Extracting Initial Consonants from Hangul (the Korean alphabet)

Cover Image for [ JavaScript ] Extracting Initial Consonants from Hangul (the Korean alphabet)
hwahyeon
hwahyeon

In this post, we'll explore how to extract initial consonants from Hangul, the Korean alphabet characters using JavaScript.

Extracting initial consonants is useful for several reasons. For example, it enables search functionalities to be implemented using only the initial consonants, and it can also be used in developing games like Hangul initial consonant quizzes.

1. Understanding Hangul's Structure

Hangul characters are composed of initial consonants (초성, choseong), vowels (중성, jungseong), and final consonants (종성, jongseong). There are 19 initial consonants, 21 vowels, and 28 final consonants used in Hangul.

For instance, the character "가" consists of the initial consonant "ㄱ" and the vowel "ㅏ". Similarly, the character "강" consists of the initial consonant "ㄱ", the vowel "ㅏ", and the final consonant "ㅇ".

Initial consonants include basic consonants such as ㄱ, ㄴ, ㄷ, ㄹ, ㅁ, ㅂ, ㅅ, ㅇ, ㅈ, ㅊ, ㅋ, ㅌ, ㅍ, ㅎ, and doubled consonants like ㄲ, ㄸ, ㅃ, ㅆ, ㅉ. The doubled consonants are essentially basic consonants written twice, representing stronger pronunciations.

In addition to the initial consonants, Hangul also features complex consonants that are exclusively used as final consonants including ㄳ, ㄵ, ㄶ, ㄺ, ㄻ, ㄼ, ㄽ, ㄾ, ㄿ, ㅀ, ㅄ. These complex consonants, which are positioned at the end of words, combine two sounds to create unique pronunciations characteristic of the Korean language. It's important to note that these complex consonants are strictly used as 받침 (jongseong), meaning they serve the role of final consonants in Hangul syllables and are not used as initial consonants (초성, choseong) at the beginning of words.

This distinction is crucial for text processing tasks, especially when handling user input during web searches or other scenarios, as users may attempt to input these complex consonants in positions other than the final consonant slot. Understanding and implementing this rule is essential for accurately extracting and processing Hangul characters in applications.

2. Extracting Initial Consonants with JavaScript

Below is an example implementation of the chosung function in JavaScript, which extracts and returns the initial consonants from a given Hangul string.

function chosung(input) {
  const LIST\_CHOSUNG \= \[
    "ㄱ", "ㄲ", "ㄴ", "ㄷ",
    "ㄸ", "ㄹ", "ㅁ", "ㅂ",
    "ㅃ", "ㅅ", "ㅆ", "ㅇ",
    "ㅈ", "ㅉ", "ㅊ", "ㅋ",
    "ㅌ", "ㅍ", "ㅎ",
  \];

  let res \= "";

  for (let i \= 0; i < input.length; i++) {
    const currentChar \= input.charCodeAt(i);

    if (currentChar \>= 0xac00 && currentChar <= 0xd7a3) {
      const idx \= Math.floor((currentChar \- 0xac00) / (21 \* 28));
      res += LIST\_CHOSUNG\[idx\];
    } else {
      res += input\[i\];
    }
  }
  return res;
}

#13

The String.prototype.charCodeAt() method returns a number representing the Unicode code point of the character at a given index.

For a detailed discussion on Unicode for Hangul, you can refer to this post.

It's important to understand that every character, including Hangul, has a unique numerical identifier associated with it. The charCodeAt() method returns this identifier as a decimal number.

const text \= "abc가각나"

console.log(text.charCodeAt(0)); // 97
console.log(text.charCodeAt(1)); // 98
console.log(text.charCodeAt(2)); // 99
console.log(text.charCodeAt(3)); // 44032
console.log(text.charCodeAt(4)); // 44033
console.log(text.charCodeAt(5)); // 45208

In the code above, we look at the Unicode code points for 'a', 'b', 'c', and the Korean characters '가', '각', '나' in decimal form. 'a' has a code point of 97, and 'b' follows sequentially with 98. Similarly, for Hangul, '가' is at 44032, and the next character, '각', is at 44033. This demonstrates how each character is assigned a unique number, showing a consecutive sequence.

#15

In hexadecimal representation, the Unicode code points for Hangul range from 0xAC00 to 0xD7A3, covering the characters from '가' to '힣'. Essentially, this line serves as a method to verify if a character falls within the Hangul Unicode range.

#16

currentChar - 0xac00: Each Hangul syllable block in Unicode is sequentially arranged starting from '가' at code point 0xAC00. The variable currentChar represents the Unicode code point of the character currently being processed. Subtracting 0xAC00 from currentChar calculates the offset of the character from the start of the Hangul block in Unicode.

Division by (21 * 28): Hangul syllables are systematically constructed from 19 초성 (initial consonants), 21 중성 (vowels), and 28 종성 (final consonants, including a position for no 종성). Since each 초성 can be combined with all 21 중성 and each of those combinations with all 28 종성, there are 588 (21 * 28) possible combinations for each 초성. Dividing the offset by 588 calculates how many full sets of 중성 and 종성 combinations have been passed, effectively determining the 초성 index of the current character.

function chosungImprovement(input) {
  const LIST\_CHOSUNG \= \[
    "ㄱ",      "ㄲ",      "ㄴ",      "ㄷ",
    "ㄸ",      "ㄹ",      "ㅁ",      "ㅂ",
    "ㅃ",      "ㅅ",      "ㅆ",      "ㅇ",
    "ㅈ",      "ㅉ",      "ㅊ",      "ㅋ",
    "ㅌ",      "ㅍ",      "ㅎ",
  \];
  const MAP\_DOUBLE\_CONSONANT \= {
    ㄳ: "ㄱㅅ",
    ㄵ: "ㄴㅈ",
    ㄶ: "ㄴㅎ",
    ㄺ: "ㄹㄱ",
    ㄻ: "ㄹㅁ",
    ㄼ: "ㄹㅂ",
    ㄽ: "ㄹㅅ",
    ㄾ: "ㄹㅌ",
    ㄿ: "ㄹㅍ",
    ㅀ: "ㄹㅎ",
    ㅄ: "ㅂㅅ",
  };

  let res \= "";

  for (let i \= 0; i < input.length; i++) {
    const currentChar \= input.charCodeAt(i);

    if (currentChar \>= 0xac00 && currentChar <= 0xd7a3) {
      const idx \= Math.floor((currentChar \- 0xac00) / (21 \* 28));
      res += LIST\_CHOSUNG\[idx\];
    } else {
      res += input\[i\];
    }
  }

  Object.keys(MAP\_DOUBLE\_CONSONANT).forEach((doubleConsonant) \=> {
    const splitConsonants \= MAP\_DOUBLE\_CONSONANT\[doubleConsonant\];
    while (res.includes(doubleConsonant)) {
      res \= res.replace(doubleConsonant, splitConsonants);
    }
  });

  return res;
}

#36

This code adds functionality to prevent the misuse of complex final consonants, known as 겹받침, as initial consonants in Hangul. 종성 can include both simple consonants, 단자음, and complex consonants, 겹받침. However, users might mistakenly input 겹받침 as 초성 when entering text, such as during searches. The code addresses this issue by decomposing 겹받침 into their constituent (단자음).