An Analysis of Josa and Eomi in Translating Korean TV Dramas Into English With Artificial Intelligence
Article information
Abstract
The goal of this study is to find out why the English subtitles of Korean TV dramas have frequent errors. It is anticipated that the findings would shed light on innovative ways for machine translation technology to agglutinate languages. To do this, as a first step, Korean-English subtitles were grammatically tagged according to the category part of speech (POS) to find out which POS has the most frequent errors in each language. Thirty-one groups were analyzed and categorized by tagging the part of speech. Then, for the Korean language, the Kokoma Korean morpheme analyzer was run to tag the Korean script according to the category noun, verb, adjective, etc. These were categorized into forty-five groups. This categorization included nine subsets of josa (postposition) and fourteen of eomi (ending), which are the most difficult parts of Korean to translate into English due to differences in linguistic structure. As a next step, the subtitles were scored and graded as the most corrected and the least corrected by Korean-American bilinguals. The results show that the most frequent error of josa is JX (auxiliary particle) among nine groups whereas the frequent error of eomi is EPT (tense prefinal ending).
I. INTRODUCTION
Korean TV drama has greatly spread all over the world during the past dozen years. Along with this spreading, the subtitles of the dramas play a crucial role for the viewer to understand them appropriately. Subtitling in Audiovisual Translation (AVT) has now received significant attention for its complicated theories and practices (Basari & Nugroho, 2017). In addition, the technical aspect of subtitling is considered important. In particular, machine translation technology has recently developed with a rapid speed in the initial stage of the Fourth Industrial Revolution. It is expressed by emerging technological breakthroughs in a lot of areas such as artificial intelligence and the Internet of Things, etc. Consequently, the statistical approach to the similarity between machine translation and human translation has been widely investigated. Ali (2016) mentions, “machine translation may play a pivotal role in helping language experts in their daily work in general and in aiding non-professionals to understand and create text in target languages in particular” (p. 55). Also, the errors of machine translation (MT) are still argued even though it has developed over a number of years. Koh (2020) states, “MT is not only related to mathematics, computer science, and linguistics but also various syntactic and semantic levels of languages have caused its error problems. MT still lacks in recognizing the proper synonym, collocation or word meaning” (p. 158). Especially, the area of spoken language is still challenging in the developing phase.
The most prominent reason for the MT errors is the difference between the first language and the second language. English and Korean are very different in terms of cultural and linguistic characteristics. Korean L2 learners have been affected by their mother tongue regarding the thought processes and their production of English. Korean and English have two different surface structures. Korean has SOV word order whereas English has SVO word order, and this causes errors in the interlingual basic order. These errors are due to learners incorrectly applying L2 basic sentential order to that of L1.
Besides being an SOV and postpositional language, the Korean language is one of the agglutinative languages with rich morphological features such as Japanese, Finnish, Hungarian, Mongolian, and Turkish. Turkish is an agglutinating language that builds words by gluing morphemes together. Dincer, Karaoglan, and Kisla (2008) remark, “the number of minor part of speech categories for Turkish are much richer than any analytical language likes English” (p. 683). Agglutinating languages that are more likely to be inflected have the sparsity problem, which leads to the difficulties in POS tagging of natural language processing (Can, Ustün, & Kurfalı, 2016). For example, Turkish has six cases such as nominative, accusative, genitive, dative, ablative and locative. Studies on the L1 acquisition of Turkish show that all types of case markings have the high error rate except for nominative and ablative case markings (Gonulal, Spinner, & Winke, 2016). The case system is the hardest aspect of Turkish morphology for English-speaking learners.
While SOV languages such as English have the semantic content of the particles added to the nouns and verb stems, the agglutinative languages have the chains of particles attached to complicated suffixes, morphemes per word structures showing many syntactic characteristics due to word-form sparsity, and variable word order. Korean words are functional words divided into two categories: josa and eomi which are well defined and frequently used (Hong, Koo, & Yang, 1996). Kim, Chae, Snyder, and Kim (2014), mention, “Josa is used to define nominal cases and modify other phrases, while eomi is an ending of a verb or an adjective to define a tense, show an attitude, and connect or terminate a sentence” (p. 637). Consequently, English and Korean have different morphological and syntactic structures. The examples are as follows (Figure 1):
Kim et al. (2014) state that agglutinative languages are computationally difficult due to word-form sparsity and variable word order. Consequently, it is also difficult to translate Korean-English exactly due to the huge differences in the linguistic structure between them.
In this paper, Korean-English subtitles are grammatically tagged according to the part of speech (POS) to find out which POS has the most frequent errors in each language. POS tagging is the marking of the morphemes based on the context meaning leading to a list of tagged morphemes. To do this, first of all, the English subtitles are parsed by the part-of-speech tagger (POS tagger) tool, which is software that reads English text and tags parts of speech to each word, such as noun, verb, adjective, etc. It was developed by Stanford University and is called the Stanford Log-Linear Part-Of-Speech Tagger. It is utilized to analyze and categorize into thirty-three groups by tagging the part of speech. Then, the Kokoma Korean morpheme analyzer is run to tag the Korean script according to noun, verb, adjective, etc. It is categorized into forty-five groups by tagging the parts of speech. As a second step, the subtitles are scored by the bilinguals who are Korean-Americans. The scores are graded as the most corrected (5 points) and the least corrected (1 point). The point 5 to point 3 are categorized into correct translation and the point 2 to point 1 are categorized into wrong translation. Even though the subtitles are grammatically wrong or awkward, if they are understandable and usable in an informal situation, the subtitles are grouped into over 3 points. As a final step, the least corrected of the two groups which have point 1 and 2 are analyzed as to which POS has the most common errors.
II. LITERATURE REVIEW
1. Korean Morphological Analysis
Korean is an agglutinative language which is like “beads on a string” with high productive morphology and has a rich set of derivational and inflectional suffixes. Huh and Laporte (2005) note, “derivational suffixes are markers of verbalization, adjectivization, and adverbialization. They are appended by applying transducers. Inflectional suffixes comprise all other types of suffixes. A single stem can be combined with up to 5,500 different sequences of inflectional suffixes” (p. 4). All part of speeches and words are structured by suffix from a noun or a verb root and words are formed by cumulating morphemes in a regulated order. As an example, 가 + 시 + 겠 + 습니 + 까 (go + honorific + future + highest + interrogative) consist of a verb stem and four morphemes. In addition, Korean has case marking affixes that mark three cases such as nominative, accusative, and so on. These affixes are attached to nouns, marking the case of the noun. Verbs inflect tense, mood, honorific (highest, high, middle, and low). According to Davis (2007), “this rich morphology and its agglutinative nature poses a formation problem, as in order to derive the full meaning of a highly inflected form, it is necessary to segment each morpheme and derive the full meaning from these morphemes” (p. 4).
Morphological analysis is the description of the internal structure of words and parts of words, such as root words, stems, prefixes, affixes, and suffixes. Park (2011) shows that features of Korean morphological analysis is as follows:
He mentions that ‘가시는’ in Korean has several meanings.
Ambiguity of part-of-speech Ambiguity of segmentation of morpheme As mentioned above, Koreans have difficulty in distinguishing them to foreigners because the same ‘가시는’ word varies depending on the context. Even though it has the exact same spelling, the word could mean different depending on the sentence and situation. For example, ‘가시 /noun + 는 /josa’ is used as a noun form where in other sentences it’s a verb or adjective. In addition, words like ‘가시 /verb + 는 /eomi,’ ‘가 /verb + 시 /eomi + 는 /eomi’ and ‘갈 /verb + 시 /eomi + 는 /eomi’ could mean different verb as well. Moreover, since Korean language does not have a clear word boundary, it leads to difficulty in translating into/from Korean. As with many other languages, the ambiguity in Korean causes major problems in POS tagging and translation into English.
2. POS Tagging
Words in a written text are composed of basic information about part of speech (POS). POS tagging is the marking of the morphemes based on the context meaning leading to a list of tagged morphemes. Dincer, Karaoglan, and Kisla (2008) state, “···the terms ‘tag’ and ‘tagging’ which are in fact interchangeable with ‘code’ and ‘codes’ …, and it is the first rule-based tagging program based on a large set of hand-constructed rules and a small lexicon to handle the exceptions” (p. 680). Taggers have been developed in many different languages.
In Korean, the unit separated by white space is called Korean word eojeol (so-called word-phrase), which is usually made up of the content word and one or several function word such as josa (postposition) or eomi (ending). For example, the word 컸다:kôssta ‘was big’ has a stem, ㅋ:k- ‘big’, and two functional morphemes, - 었:-ôss- (past) and - 다:-ta (declarative). In contrast to functional morphemes, stems are all lexical morphemes. The surface form of a morpheme shown in a sentence may depend on adjacent morphemes and is different from its base form or lexical form. For instance, the surface form of the stem ‘big’ is ㅋ:k- before the suffix - 었:-ôss- (past), but its base form is 크:keu-. Changing the endings could change the verb tenses, and this often requires to change the final consonant. That is, the variation of the first letter to express verb tenses such as 컸 or 클 is the surface form. The variations are called phonotactic, and the surface variants are called allomorphs. The morphological analysis ascertains morphemes and assigns pertinent information to them. Nguyen, Vo, Shin, Tran and Ock (2019) remark, “Korean morphological analysis is to decompose a word-phrase into morphemes with three processes as follows”:
Separating a word-phrase into morphemes;
Recovering the original form for changed phonemes;
Tagging the POS to each morpheme (Nguyen et al., 2019, p. 414)
This research uses Kokoma Korean morphological analyzer to do analytic tagging. This is the example using Kokoma:
>>> from konlpy.tag import Kkma
>>> from konlpy.utils import pprint
>>> kkma = Kkma()
>>> pprint(kkma.pos (한국어 분석은 재밌습니다!))
[(한국어, NNG), (분석, NNG), (은, JX), (재밌, VA), (습니다, EFN), (!, SF)]
3. Differences Between Languages
Shaffer (2015) remarks, “Many SVO languages are prepositional, while many SOV languages are postpositional … Korean is both SOV and postpositional, while English is both SVO and prepositional” (p. 221). Accordingly, the errors are shown as the dropping of the English preposition since there is no equivalent structural constituent preceding the noun in Korean:
Korean (SOV): Gene-i hakgyo-e gass-da.
English (SVO): Gene went to school.
English Error: Gene went school. (Adapted from Shaffer, 2002, p. 221)
Errors of the Korean postpositional particle -eui (a) involve the Korean L2 learner associating it with English (c), which keeps the noun order of the Korean (a), and produces errors in English. Jimseung refers to the beasts and wang to kings.
Korean (SOV): jimseung-eui wang
English (SVO): king of beasts
English Error: beast of king
Therefore, this study addresses the following research questions:
Among the content or function words in English, which has more errors?
Which has the most frequent errors among seven division (N, V, M, I, J, E, X) in Korean?
What are the common errors among josa or eomi in Korean?
Accordingly, it is hypothesized that the translation of Korean to/from English will cause errors in the postpositional particle. It is supported the hypothesis that the translation of Korean to/from English causes errors in the postpositional particle.
III. METHOD
Korean to English subtitles are grammatically tagged according to the POS in order to determine which POS has the most frequent errors in each language. This study is guided by this fact. It involved —
parsing English subtitles by the POS tagger, which reads English text and tags parts-of-speech to each word, such as nouns, verbs, and adjectives. The POS tagger is a natural language parser program that analyzes the grammatical structure of sentences and is called the Stanford Log-Linear Part-Of-Speech Tagger. The data were analyzed and categorized into 31 groups by tagging the parts-of-speech: DT, RP, CD, NN, NNS, NNP, NNPS, EX, PRP, PRP$, POS, RBS, RBR, RB, JJS, JJR, JJ, MD, VB, VBP, VBZ, VBD, VBN, VBG, WDT, WP, WP$ WRB, TO, IN, and CC. This categorization is included in the Appendix. The Korean subtitles were first processed in Openkoreatext, an open-source Korean text processor that handles Korean normalization and tokenization. It is categorized into 14 groups by tagging the parts-of-speech: nouns, verbs, adjectives, adverbs, determiners, exclamations, josa, emoi, preeomi, conjunction, modifier, verb prefix, and suffix. Next, we require a more detailed and specified analyzer in terms of postpositional particles such as josa or eomi.
A Kokoma Korean morphological analyzer, one of the most necessary parts in natural language processing systems, is used. This is an open tool available for online use and as a downloadable application. The Kokoma Korean morphological analyzer tags the Korean script according to nouns, verbs, and adjectives. They were categorized into 45 groups by tagging the parts-of-speech: NNG, NNP, NNB, NNM, NR, NP, VV, VA, VXV, VXA, VCP, VCN, MDN, MDT, MAG, MAC, IC, JKS, JKC, JKG, JKO, JKM, JKI, JKQ, JC, JX, EPH, EPT, EPP, EFN, EFQ, EFO, EFA, EFI, EFR, ECE, ECS, ECD, ETN, ETD, XPN, XPV, XSN, XSV, XSA, and XR. This categorization includes nine subsets in josa and 14 in eomi, which are counted in each sentence. The data reveal which ones are the most frequent and the least frequent in English and Korean. In this study, all punctuations were removed from the data1.
The subtitles are scored by Korean American bilinguals. The scores were graded from “most corrected” (5 points) to “least corrected” (1 point), with points 5–3 indicating correct translation and points 2-1 as wrong translations. Although the subtitles are grammatically wrong or awkward, if they are understandable and usable in an informal situation, the subtitles are grouped into over three points. The least corrected of the two groups that have point 1 and 2 scores are analyzed to determine which POS has the most common errors.
The four participants who score the subtitles are the bilinguals who teaches English in a Korean University. They have experience staying in America over five years and they could speak both Korean and English fluently.
1. Data Collection
The Korean-English subtitle and Korean of the Korean TV drama first were recorded into an Excel document and analyzed by this paper’s researcher. There are over ten websites in the United States running Korean TV dramas on the Internet: Hulu.com, Viki.com, Dramafever.com, mysoju.com, mykofan.com, hacienma.net, gooddrama.net, dramacrazy.com, mvibo.com, etc. Among them, Korean-English subtitles of Hulu.com and Viki.com are examined. Hulu is an ad-supported on-demand streaming video website of TV shows and movies, and is currently offered only to users in the U.S. and is blocked by IP address locations from outside the U.S. Viki is a video streaming website based in Singapore that provides on-demand streaming video of TV shows and movies from around the world. It is also the first and fastest platform for subtitling of video, counting on a community of thousands of volunteer translators (“Viki”, n.d.).
The Korean romantic drama The Winter, The Wind Blows ran from February to March 2013 by SBS, which is TV series starring Jo In-sung and Song Hye-kyo. This story is about what a gambler pretends the lost brother of a blind heiress, but these two people fall in true love after getting to know each other. The drama has a total of 16 episodes, and the object of this study is episode one and eight. Episode one has each 467 subtitles in Hulu.com, Viki.com and episode eight has 385 subtitles in both. The total number of the subtitles is 1,704.
2. Data Analysis
This research shows that the content words such as noun, verb, adjective, and adverb have even more frequent errors than function words such as article, conjunction, preposition, pronoun in English. The rank of the most frequent errors is PRP, NN, VB, RB, DT, and IN. PRP and NN are ranked first and second, respectively (see Table 1).
According to the analysis of the Openkoreatext, the most frequent is content words such as Noun, Verb, and Adjective in Korean which is the same line as in English. The noticeable thing in Korean is the errors of josa and eomi which is postposition particle having in agglutinative language. Josa is a Korean noun-ending while eomi is a Korean verb-ending, which would mean high error frequency of josa and eomi indispensable because they are morphologically dependent on stems such as nouns and verbs (see Figure 3).
At this point of the research, it is needed to investigate josa and eomi more detailed and specifiy. The version of Kokoma (KKMA) provided by KoNLPy has a large set of 56 tags versus 14 tags for the Openkoreatext. In this research, KKMA Korean morphological analyzer is used for the josa and eomi. Josa is subcategorized into nine groups such as JKS, JKC, JKG, JKO, JKM, JKI, JKQ, JC, JX and the data is as follows (see Table 2):
The data shows that the most frequent errors are JX, JKM, JKO, and JKG. Eomi, the verb-ending, is largely categorized into four groups in the following: EP, EF, EC, and ET.
Eomi is subcategorized into fourteen such as EPH, EPT, EPP, EFN, EFQ, EFO, EFA, EFI, EFR, ECE, ECS, ECD, ETN, ETD and the data is as follows:
The above result reveals that the most frequent errors are EFN as 3.75%, and the next orders are EPT, EPH, ECE, and ETD.
IV. DISCUSSION
1. Josa (Korean Noun-Ending)
Josa, Korean noun-ending is particle which comes after a word or clause to represent a certain relationship with another word on the sentence, or add a certain meaning to a word on the sentence. It has case particle, connective particle and auxiliary particle. Particles are not used alone and they should necessary be attached to the end of a noun.
The form of particles does not change at the place where they are. The most frequent errors are JX which has meanings as in the case of adverb or English prepositions and complete its meaning to the preceding word. It can be added to objects, complements, adverbs as well as subjects, and readers or speakers should be paid attention to the JX.
JK (Josa for case marker), case particle affixed to the end of a noun plays a certain role in the sentence. As mentioned before, it has seven particles and account for the most frequent errors such as JKM, JKO, and JKG. JKM, Adverbial Particle is used in a wide range of forms such as – 와 /- 과, - 에 (서), -( 으) 로, - 에서, etc. and change the word with those kinds of particles into adverbial particles.
The JKO can be divided into direct object particle and indirect object particle. They play roles making the preceding word the object of the sentence. Direct object particles such as – 을 / 를 are the phonological allomorph. Although they have the different forms, they have the same sentence function, which could lead to confusion among speakers/writers.
JKG, genitive postposition, is like an apostrophe + s (‘s), and makes the preceding word an adnominal phrase of the sentence and which is often called adnominal particle. However, - 의, Korean genitive postposition has diverse meaning functions depending on the surrounding words such as ‘have by someone’, ‘written by someone’, and ‘about something’ etc. Consequently, it could lead to wrong translation.
2. Eomi (Korean Verb Ending)
Korean verb must be affixed with an ending to form an eojeol. Eomi, Korean verb ending is largely divided into two categories depending on the position and function of the eomi: EP on position and EF, EC, ET on function. Also, when the verb stem is affixed with EF and EC, it has the function of predicate. On the other hand, when combined with an ET, it has the function of modifier.
EP is positioned in the middle of a word as prefinal ending that attached to verbs and placed between the stem and the final ending. It is subcategorized into three group such as EPH, EPT, and EPP. EPH is a honorific prefinal ending for older people such as ‘-시 -:-si-’ and – 사오 -:-sao-and the speaker honors the subject of the sentence. EPT is a tense ending such as ‘-은 -: -eon/- 는:-neon-(present), ‘-었 -:-ôss- (past), ‘-겠 -:-gess-(future) and EPP is a polite particle such as ‘-옵 -:-ob-. Eomi is categorized into three groups depending on the function of the eomi such as EC, EF, and ET and it is subcategorized into eleven group. The finding shows EPT is the most frequent ending among EPH, EPT, and EPP.
As shown in Table 8, among the three kinds of tense, the data shows most errors are shown in the past tense, ‘-었 -:-ôss- which is EPT.
EF (Final Ending) is attached at the end of the sentence and has the function of sentence completion and describe the sentence types. EF shows six types of the sentence such as EFN, EFQ, EFO, EFA, EFI, and EFR. EFN has sixty declarative ending. Consequently, among them, the most frequent error is the EFN such as - 다: da, - 네: ne, - 지:-ji, - 어: -eo, with a large gap.
EC connects the preceding words (word, phrase, or clause) to the following words and has the most frequent errors. It is subcategorized into ECD, ECE and ECS. The most frequent errors are shown in the ECD which has a linking role such as - 지:-ji, - 아:-a-, - 니 -:-ni-, -’ 는데 -’: -neunde-, - 게 -: -ge-, etc. ECD is so diverse that it is more likely to make errors in translation.
ECE is coordinative connective ending after the stem connects the preceding words to the following words coordinately such as - 면:-myeon-, – 고 /go, -( 으) 며 /(eu)myeo, -( 으) 나 /-(eu)na, - 지만 /-jiman, - 건 /geon and has the third frequent errors. They list two or more facts and connect several clauses with similar or contrast meaning.
ECS is the subsidiary linking eomi such as - 아,- 어,- 여 /-a, eo, yeo, - 게 /ge, - 지고 /jigo and has the fifth frequent errors. Linking Emoi - 어 in Korean could be declarative or question depending on the type of the sentences.
ET is called Conversion Ending and is attached to the verb stem of the verb or adjective and make it perform the function of another part of speech. ET is divided into ETN and ETD. ETN is attached to the verb stem and change it into a word that is functionally same as a noun. They perform as subject, object and adverb. ETD is attached to the verb stem and change it into abdominals. ETD can be grouped by tense and aspect such as present-continuous, past-perfect, past-retrospect, past perfect-retrospect, future-guessing/will/ability. It leads to confusion to use. The data shows that the most frequent errors are ETD.
V. CONCLUSION
The goal of this study was to find out why the errors happens frequently in the subtitles of Korean TV drama. This finding showed that the content words have even more frequent errors than function words in English as hypothesized. The most frequent errors were ranked PRP, NN, VB, RB, DT, and IN. Likewise, the most frequent errors were content words such as noun, verb, adjective in Korean. The result of the error analysis through POS tagging and scoring showed that the most common errors are Korean function words eomi and josa. To gain more specific and detailed findings, josa is subclassified into nine groups and emoi is into four groups through KKMA Korean morphological analyzer. The orders of the most frequent errors in josa are JX, JKM, JKO, and JKG among nine groups whereas those of the frequent errors in eomi are EPT, EPH, ECE, and ETD. As mentioned before, Korean has rich morphology and agglutinative nature, and it is regarded as the hardest aspect in translating Korean into English or vice versa. The morphological analysis is to ascertain morphemes and assign pertinent information to them.
This is conducted for Korean language among agglutinating languages, but if this study is conducted for other agglutinating languages such as Turkish and Japanese etc., it would be meaningful and worthwhile to enlighten the method of teaching second language. In addition, it is suggested that the future study should be conducted based on this study. For example, after L2 learners are taught the basic knowledge of Korean functional words, on josa or eomi, they are tested to translate Korean sentences into English to demonstrate their “before and after” degree of improvement in English writing. Further, as the importance of the functional words places an emphasis on Korean-to-English translation, our findings are also useful for machine translation of agglutinating languages in the Fourth Industrial Revolution. Ryoo and Cho (2017) mentions “translation is a special kind of cross-cultural communication, involving two linguistically different cultures. Cross-cultural awareness means the translator’s perception of the cultural elements of the languages involved in the process of translation” (p. 96). It could have a critical effect on the quality of subtitles translated Korean-to-English as Korean culture and media has become one of the most important exports in the world.
Notes
Please see the Appendix for the complete Kokoma Korean POS tag comparison chart.