J Eng Teach Movie Media > Volume 26(4); 2025 > Article
Woo and Lee: Incidental Learning of Multi-Word Expressions Through Watching a Video in Different Captioning Modes*

Abstract

Although Multi-Word expressions (MWEs) are widely recognized as reliable indicators of language mastery (Dang et al, 2021; Majuddin et al., 2021; Puimège & Peters, 2019), empirical research examining both receptive and productive knowledge of MWEs across on-screen captions remains inconsistent. The present study aimed to clear this uncertainty by focusing on Korean-speaking novice English learners and their incidental acquisition of MWEs using form recall and form recognition under three caption conditions: no captions, whole captions, and MWE-enhanced captions. A total of 55 undergraduates from various disciplines in South Korea participated in a three-session design with intervals of three to four weeks. An analysis included data from fifty participants who completed form recall and form recognition tests in all sessions. Results indicated all groups achieved modest improvement, but there were no statistically significant differences between captioning types. Nonetheless, the MWE-enhanced captions showed slightly better performance than the others, suggesting a moderate pedagogical promise for this type of input. A slight gap in MWE gains from form recognition and form recall highlights the instructional relevance of these dimensions of MWE learning. Overall, more research is needed to explore how recognition and recall of MWEs develop across different proficiency levels and input modes.

I. INTRODUCTION

Multi-Word expressions (MWEs), also known as Multi-Word sequences, have been extensively examined in pursuit of second language (L2) mastery (Dang et al., 2022a, 2022b; Fernandez & Schmitt, 2015; Majuddin et al., 2021; Saito & Liu, 2022; Shin et al., 2023; Siyanova-Chanturia & Pellicer-Sánchez, 2019). MWEs are lexical units including more than a single word, and are also referred to as n-grams and formulaic sequences. They can be categorized in terms of frequency, length, fixedness, abstractness, and figurativeness/literality. Common types of MWEs, such as collocations, idioms, binomials, lexical bundles, and other phrasal elements, are pervasive and occur repeatedly in languages (Siyanova-Chanturia & Pellicer-Sánchez, 2019; Siyanova-Chanturia & Sidtis, 2019). The acquisition of MWEs is considered a strong predictor of nativelike proficiency in foreign languages (Boers et al., 2014; Nation, 2013; Siyanova-Chanturia & Pellicer-Sanchez, 2019; Tavakoli & Uchihara, 2020; Vu et al., 2023; Yu et al., 2025).
Despite MWEs’ recognized importance, L2 learners often struggle to acquire (Dang et al., 2022a; Foster et al., 2014; Nguyen & Webb, 2017). Several factors may cause this difficulty including low frequency (Peters & Webb, 2018; Puimège et al., 2021), situational constraints in EFL settings (Dang et al., 2022a, 2022b; Peters & Webb, 2018: Pu et al., 2024; Puimège & Peters, 2020), and the figurative nature of MWEs (Majuddin et al., 2021). Due to the complexity of MWEs’ figurative meanings, explicit teaching has been suggested by many researchers (Boers & Lindstromberg, 2012). However, limited classroom time often prevents L2 learners from receiving sufficient exposure to MWEs (Nguyen & Webb, 2017; Puimège & Peters, 2020). Appropriate and abundant input has been proposed as essential for language mastery (Krashen, 1985). In this regard, incidental MWEs learning through viewing has been explored and proven to some degree of MWE intake (Dang et al., 2022a, 2022b; Majuddin et al., 2021; Puimège et al., 2021; Puimège & Peters, 2019, 2020; Webb & Chang, 2022).
Previous research examined incidental MWE learning through various input modes (Dang et al., 2022a, 2022b; Pu et al., 2024; Webb & Chang, 2022). Some have explored incidental MWE gains in form recognition through unimodal or multimodal input: reading-only, listening-only, reading-while-listening, watching with no captions, and watching with captions (Dang et al., 2022a, 2022b; Pu et al., 2024; Webb & Chang, 2022). Other studies have explored incidental MWE learning at form recall through different languages on screen (Majuddin et al., 2021; Puimège & Peters, 2020). Although previous research revealed that learners’ proficiency could affect the gap between productive (e.g., form recall) and receptive (e.g., form recognition) knowledge (Lee, 2025), few studies have simultaneously investigated both receptive and productive aspects of MWE incidental acquisition.
To resolve this issue, the present study analyzed the incidental learning of MWEs by EFL novice learners in both form recall and form recognition across three different caption modes through audio-visual input. This study holds meaningful pedagogical potential for English teachers seeking to provide authentic and autonomous learning. Moreover, it can contribute to understanding the pedagogical significance of MWEs’ incidental learning with different captioning types.

II. LITERATURE REVIEW

1. Incidental Learning of MWEs

Learning MWEs is considered essential for achieving L2 mastery (Boers et al., 2014; Nation, 2013; Siyanova- Chanturia & Pellicer-Sanchez, 2019; Tavakoli & Uchihara, 2020; Vu et al., 2023; Yu et al., 2025), yet it remains a challenge for L2 learners (Majuddin et al., 2021; Nguyen & Webb, 2017; Peters & Webb, 2018; Pu et al., 2024; Puimège et al., 2021; Puimège & Peters, 2020). In particular, insufficient exposure to the target language in the EFL settings restricts learners’ opportunities to learn MWEs. Therefore, many studies have examined the effectiveness of MWEs intake through various modes of input, and findings have proven positive learning outcomes (Dang et al., 2022a, 2022b; Majuddin et al., 2021; Pu et al., 2024; Puimège et al., 2021; Puimège & Peters, 2019, 2020; Teng, 2019; Webb & Chang, 2022). Studies comparing input modes such as reading and viewing have provided valuable insights into how different learning conditions influence MWE acquisition (Dang et al., 2022a, 2022b; Pu et al., 2024; Webb & Chang, 2022).
Dang et al. (2022a) compared the effects of various modes on MWE intake using unimodal input (e.g., reading, listening) and multimodal input (e.g., reading-while-listening, viewing with/without captions). It was proven that reading-only and viewing brought about higher MWE gains in form recognition compared to reading-while-listening and listening-only. Similarly, Pu et al. (2024) gained a similar result through three input modes in their research on incidental MWE gains. Learners across proficiency levels improved their MWE incidental knowledge, and multimodal input, such as viewing with captions, yielded greater results in receptive and productive MWEs. In addition, higher L2 proficiency was proven to predict more MWE learning gains.
In contrast, Webb and Chang’s study (2022) reported partially different results. Webb and Chang (2022) compared unimodal and bimodal input and demonstrated that the reading-while-listening mode benefited the learning of MWEs more than reading-only. Since their research did not include audio-visual input, further studies are needed to clarify how such input compares across modes. The finding might suggest that the prosodic feature of MWEs might have affected the acquisition of MWEs (Hallin & Sidtis, 2015; Lin, 2018), as MWEs, as a chunk, may be more easily processed while viewing and reading-while-listening. Thus, additional research is necessary to explore which modes are most conducive to MWE learning gains.
The efficacy of multimedia input led to more research on the incidental MWE acquisition through languages onscreen (Majuddin et al., 2021; Puimège & Peters, 2020; Teng, 2019). Puimège and Peters (2020) found that viewing without captions improved MWE knowledge, especially in form recall, achieving approximately 30% gains, which is greater than those observed for meaning recall without captions. However, other studies have reported that captions can facilitate incidental MWE acquisition (Dang et al., 2022a; Majuddin et al., 2021; Pu et al., 2024; Teng, 2019).
Teng (2019) investigated the impact of three captioning modes (i.e., full captioning, keyword captioning, and no captioning) on MWEs multidimensionally through video viewing. Receptive knowledge was measured through a multiple-choice test of form recognition, while productive knowledge was assessed through a form recall task requiring learners to write MWEs. Viewing with full captions produced the greatest learning gains, followed by keyword and no captions. However, form recall showed an insignificant difference among captioning types.
Majuddin et al. (2021) experimented with three different language modes—no captions, normal captions, and captions with MWE-enhanced—in combination with two viewing frequencies (single or twice watching). Their findings showed that participants watching twice in both caption modes performed better in form recall. Under the single-watching condition, the caption group outperformed the MWE-enhanced group. However, in watching twice, the MWE-enhanced captions resulted in better performance compared to full captions. These findings suggest that typographic enhancement may require more frequency and exposure to effectively draw learners’ attention and facilitate noticing.
In summary, incidental MWE acquisition has been investigated across unimodal and multimodal input. Viewing and viewing with captions have proven effective for improving form recognition (Dang et al., 2022a; Pu et al., 2024). However, findings regarding language modes on screen and their effects on form recall remain inconsistent. Few studies have simultaneously addressed incidental MWE acquisition in both form recall and recognition across different language modes. The purpose of this study is to investigate whether on-screen language modes while viewing impact incidental MWE learning in both form recall and recognition among Korean-speaking novice English learners.

2. Multi-Word Expressions

MWEs are combinations of words that commonly appear as fixed ones (Siyanova-Chanturia & Pellicer-Sánchez, 2019; Webb et al., 2013). A considerable portion of MWEs in L2 poses challenges for L2 learners in effectively conveying communication. MWEs include various lexical types, such as collocations (e.g., “have a thing for,” “drink a toast”), verb phrases (e.g., “look up”), binomials (e.g., “ladies and gentlemen”), proverbs (e.g., “the early bird catches a worm”), and lexical bundles (e.g., “by the way”) (Siyanova-Chanturia & Sidtis, 2019).
L2 learners find it challenging to learn Multi-Word expressions due to the incongruent vocabulary between L1 and L2. Not only do their figurative features provide a challenge to L2 learners, but their low frequency increases the learning burden (Macis et al., 2021). Therefore, this study chose the target MWEs that occur frequently together and have high semantic transparency (Webb et al., 2013). Mutual information (i.e., a measure of collocational strength) was employed to ensure ecological validity. Moreover, non-target MWEs were included to encourage participant engagement and minimize potential floor effect.
The acquisition of MWEs has attracted attention from second language acquisition (SLA) scholars with regard to language processing and output (Dang et al., 2022a; Pu et al., 2024; Van Patten, 2007). Recent research has employed multimodal input to investigate the learning of Multi-Word expressions. Multimedia allows the presentation of nonverbal and verbal information simultaneously, potentially enhancing information processing (Pu et al., 2024). However, multimodal conditions, such as viewing with captions, can also cause additional cognitive load while processing multimodal input (Gass et al., 2019). This study investigated captioning types in multimedia that influence the acquisition of target MWEs.

3. Vocabulary Acquisition Through Captioned Video

A plethora of studies have demonstrated the advantages of viewing in vocabulary development (Dang et al., 2022a, 2022b; Majuddin et al., 2021; Montero Pérez & Rodgers, 2019; Montero Pérez et al., 2018; Peters & Webb, 2018; Teng, 2019, 2022). Studies have examined different captioning modes, such as full and keyword captions (Dang et al., 2022a, 2022b; Gass et al., 2019; Majuddin et al., 2021; Montero Pérez et al., 2018; Peters, 2018; Pujadas, 2019; Pujadas & Muñoz, 2020; Puimège & Peters, 2020, Teng, 2019, 2022; Yu et al., 2025). The findings on the effects of different captioning modes on vocabulary learning remain inconsistent. For example, Peters (2018) indicated that intermediate-level learners could benefit more from videos without captions, while learners with limited L2 exposure are suggested to view with captions as an effective supplementary resource (Lindgren & Muñoz, 2012; Suarez & Gesa, 2019). Captioned videos are argued to support comprehension and facilitate vocabulary acquisition (Gass et al., 2019; Montero Pérez et al., 2013; Puimège & Peters, 2020).
Such benefits may also extend to the learning of MWEs. Majuddin et al. (2021) experimented with the three language modes and different viewing times. Two caption groups with double viewing outperformed the no caption group, with MWE-enhanced captions slightly higher but not significantly compared to full captions. However, a single viewing with full captions resulted in better MWE gains than enhanced MWEs. The researchers suggested that due to the limitation of fast-moving scenes, learners do not have enough time to process enhanced MWEs compared to reading. Similarly, Teng (2019) experimented with the learning gains of verb-noun collocations using whole and keyword captions and found that the whole caption group outperformed keyword (MWEs) captions. However, there are some issues in the two studies. Majuddin et al. (2021) investigated the effect of captions on MWE acquisition in form recall (productive knowledge), but Teng (2019) focused on form recognition (receptive knowledge). Limited research has investigated which language modes most effectively facilitate MWE acquisition across both receptive and productive dimensions. To deal with the inconsistency, the present study aims to examine the effects of three distinct on-screen languages on the acquisition of MWEs, targeting both form recall and form recognition.

4. Research Questions

The current study explores beginners’ incidental acquisition of MWEs through watching a TV program. A few studies have examined the incidental learning of MWEs with intermediate-level learners in various language modes (Majuddin et al., 2021), but it’s uncertain whether audio-visual input would influence beginners’ learning gains of MWEs. Furthermore, this study aims to delve into the difference in form recognition and recall acquisition of MWEs. The research questions are as follows:
1) Does viewing a video in different captioning modes affect the incidental recognition of MWEs’ form?
2) Does viewing a video in different captioning modes affect the incidental recall of MWEs’ form?

III. METHODOLOGY

1. Participants

This research included 55 Korean-speaking adult EFL learners (33 males and 22 females). Five participants who did not complete all the sessions were excluded, resulting in 50 participants (No Captions = 16, Whole Captions = 18, MWE-Enhanced Captions = 16). The participants were undergraduates majoring in various disciplines, including Marine Biotechnology, Information & Communications Engineering, and Sports Science, and their ages ranged from 19 to 26. Three intact classes were assigned to one of the language modes: no caption, whole caption, and MWE-Enhanced Captions. Before the experiment, all the participants completed the Cambridge English Language Assessment test to determine their proficiency level. Table 1 presents the mean scores and standard deviations of their proficiency test results: No Caption (M = 8.38, SD = 2.65), Whole Caption (M = 8.50, SD = 2.12), and MWE-Enhanced Caption (M = 9.88, SD = 2.63).
All participants scored between 5 to 14 out of 25 on the Cambridge English Language Assessment, corresponding to the A2 (beginner) level of CEFR, with one participant at B1. A one-way ANOVA confirmed that the three groups were statistically almost homogeneous in their proficiency level, F(2, 47) = 1.85, p = .16. Prior to the experiment, participants completed pretests on the MWEs form recall and recognition. As shown in Table 3, all groups scored low on the productive form recall pretest, indicating limited prior MWEs knowledge. From the pilot study of participants with similar proficiency, the participants’ proficiency level was unsuitable for measuring productive MWE learning through incidental exposure. To address this, an MWE form recognition test was included to assess receptive knowledge. The recognition results were higher than recall scores, suggesting moderate receptive knowledge. The Enhanced Caption group showed the highest mean score (M = 3.06, SD = 2.46) in the form recognition pretest, followed by the Whole Caption group (M = 2.61, SD = 2.03) and the No Caption (M = 2.38, SD = 2.36) (see Table 2). The form recall pretest, however, showed different results: No Caption (M = .47, SD = 1.20); Whole Caption (M = .33, SD = .59); MWE-Enhanced Caption (M = .31, SD = .60) (see Table 3). The findings indicated that while participants’ productive skills were limited, their receptive knowledge provided a suitable basis for examining incidental MWE acquisition through captioned video input.

2. Audio-Visual Input

The American sitcom series How I Met Your Mother (Fryman, 2005) was selected as the input because it provides authentic discourse and rich exposure to MWEs (Montero Pérez & Rodgers, 2019). A 22-minute episode from the first series was chosen to match participants’ proficiency, as half of them knew between 2,000 and 3,000 of the most frequent words, making the episode largely comprehensible and engaging. Lexical analysis indicated 3168 tokens and 736 types in AntConc (Anthony, 2024), while Lexical Tutor Profile showed 3151 tokens and 719 types, with minor differences. Lexical profiling showed that the 3,000 most frequent words covered 87.5% (AntConc) and 94.8% (Lexical Tutor) of the episode, while even a 1,000-word base covered 91.5%. These results suggest that the episode’s vocabulary was appropriate for participants and suitable for supporting incidental MWE learning.

3. Target Items (MWEs)

Watching English programs exposes learners to a large number of MWEs. Previous studies on MWEs have handled specific collocations, such as ‘adjective + noun,’ ‘verb phrases,’ and ‘prepositional phrases,’ often by modifying texts to include sufficient target items. However, such materials lack ecological and authentic value (Majuddin et al., 2021; Peters et al., 2019; Puimège & Peters, 2019) The present study aimed to investigate incidental MWE acquisition from authentic input, providing both language and socio-cultural exposure. Following Majuddin et al. (2021), the study targeted 16 MWEs primarily consisting of phrasal verbs (e.g., “plan out,” “go out”) and idioms (e.g., “a long shot”). The transcript of the input was thoroughly analyzed to identify target MWEs likely unfamiliar to participants. The selected items were validated using online dictionaries, the Corpus of Contemporary American English, and Lexical Tutor. All MWEs had over 100 hits in COCA and mutual information (MI) values above 3, except ‘bear with (MI = 0.35),’ which was retained for ecological validity. A total of 27 MWEs were presented to participants to maintain attention and avoid floor effects (i.e., a condition where numerous scores cluster at the possible lowest point of measurement). After a pilot test with students of similar proficiency, 16 target MWEs were finalized for form recall and recognition tests. Item frequency in the input varied, with one MWE occurring three times, five occurring twice, and the remainder occurring once (see Table 4).
For the Whole Caption group, the captions appeared at the bottom of the running time. 16 target MWEs, including 11 nontarget MWEs, were enhanced with bold, underlined, and a different color from other expressions.

4. Instrument

1) Prior Knowledge of Participants

In the first session, learners’ general English proficiency was assessed, utilizing the Cambridge English Language Assessment (CELA) and pretests on MWEs. Previous studies have proven that proficiency is a critical factor in comprehending audio-visual input and acquiring MWEs, as learners are unlikely to focus on target items without understanding the content (Laufer, 2006; Puimège & Peters, 2020). Because prior knowledge of target MWEs influences learning outcomes, both proficiency and pre-MWE knowledge were measured two weeks before the treatment to account for learner-related factors (Puimège & Peters, 2020; Pujadas & Muñoz, 2020).

2) Testing MWEs

This study assessed learners’ incidental acquisition of MWEs using two tests: form recall and form recognition. The pilot study focusing on form recall of MWEs with low-proficiency learners did not yield significant results, likely due to a floor effect. This current study incorporated both form recall and recognition tests to better capture learners’ gains (Jelani & Boers, 2018), administered as pretests, immediate posttests, and delayed posttests. The tests were fabricated based on Choi (2017) and Pellicer-Sánchez (2017), with items drawn from the audio-visual input to align with participants’ A2-level proficiency and maximize contextual relevance (Schmitt, 2010). MWEs recall tests adopted gap-filling questions with L1 translation of sentences, and the first cues (see Figure 1)
Previous research indicates that productive (recall) and receptive (recognition) vocabulary knowledge move in the same direction (Ullah et al., 2024; Webb, 2008; Yilmaz & Kavanoz, 2025; Zheng, 2009; Zhou, 2010). Form recognition has been adopted to measure MWE incidental acquisition for EFL learners (Dang et al., 2022a; Pu et al., 2024). Form recognition test adopted multiple choices with one correct answer and three distractors in a given context. Participants also checked the degree of their knowledge of their chosen MWE expressions. The example of a question is as follows (see Figure 2).
The same tests were administered over three sessions: pretests, immediate posttests, and delayed posttests, with a 3- to 4-week interval. To minimize the testing effect, the items were arranged in a different order for each test. The immediate and delayed posttests were administered a month apart to measure long-term retention. Participants were not given feedback or allowed to ask or search for external resources between tests.

3) Questionnaire

The questionnaire, adapted from previous studies (Puimège & Peters, 2019; Wi, 2021), explored participants’ viewing habits, perceptions of incidental learning, content, and language modes. A 4-point Likert scale was adopted to avoid a neutral midpoint and ensure reliability and validity (Chang, 1994). The questions were written in L1 for the sake of accurate information, with the scale ranging from 1 = disagree, 2 = somewhat disagree, 3 = somewhat agree, and 4 = agree.

4) Scoring

Scoring the MWEs recall test was straightforward. Each item was given 1 point for a completely correct answer. Partial score (i.e., 0.5) was assigned when all the words were responded to, but contained a minor mistake, such as a letter typo with similar pronunciation (e.g., “sattle down” [sic]). The maximum score for each recall test was 16 points, excluding nontarget MWEs. MWE’s recognition test consisted of 16 multiple-choice questions, each scored as 1 point, for a total of 16 points. In the listening comprehension test, each of the 11 questions was scored as 1 without partial scores, resulting in a total of 11 points.

5) Procedure

Figure 3 presents an overview of the study, which consisted of three sessions: pre-treatment, treatment, and posttreatment. Major data were collected across three sessions, except for the semi-structured interview. During the first session, participants completed a proficiency test (CELA) and the MWEs pretests. The second session involved the intervention, during which participants viewed the video under three different on-screen languages. Right after viewing, the MWE recall immediate posttest was administered, and a questionnaire and listening comprehension test were followed to prevent testing effects and get some insights. The questionnaire and listening comprehension were not analyzed in the current study. The MWE recognition test was also conducted in this session. In the final session, participants completed both MWE tests again. Afterwards, selected participants took part in further in-depth interviews.

6) Data Analysis

For group congruency, descriptive analysis and ANOVA were conducted. The first research question investigated the acquisition of MWEs’ form recognition in three captioning modes. Two-way repeated measures ANOVA was employed to analyze the MWE recognition tests. The design designated Time as a within-subject factor, and Group by caption modes as a between-subject factor.
The second research question is to explore the effect of different language modes on form recall. One-way ANOVA was conducted to analyze the impact of three captioning types on MWE form recall tests. Time (i.e., pretest, immediate posttest, and delayed posttest) was also designated as a within-subject factor.

IV. RESULT

1. Does Viewing a Video in Different Captioning Modes Affect the Incidental Recognition of MWEs’ Form?

MWE recognition was assessed at pretest, immediate posttest (3-week interval), and delayed posttest (one month later) to examine both learning gains and retention (see Table 5). Pretest scores were low and comparable across groups: No Caption (M = 2.37, SD = 2.36), Whole Caption (M = 2.61, SD = 2.03), and MWE-Enhanced Caption (M = 3.06, SD = 2.46) (Table 5). After viewing, all groups improved, with the MWE-Enhanced Caption group showing the highest immediate posttest scores (M = 5.50, SD = 3.20), followed by Whole caption (M = 5.39, SD = 2.28), and No Caption (M = 4.19, SD = 3.31). Delayed posttest scores declined slightly but remained above pretest scores, with MWE-Enhanced Caption still retaining the highest mean (M = 4.44. SD = 3.33).
Figure 4 presents the estimated marginal means of the MWE form recognition between groups in pretest, immediate post-test, and delayed post-test. All the groups performed better than pretests. The Enhanced Caption outperformed the other groups after the intervention, and in the delayed posttest, followed by the Whole Caption.
A repeated-measures ANOVA showed a significant main effect of Time, F(2,94) = 17.23, p < .001, indicating improvement over time. The Time x Group interaction was not significant, F(4,94) = .35, p = .85, and between-group differences were non-significant F(2,47) = .60, p = .55. These findings suggested that while all participants benefited from audio-visual input, the differences between caption modes were not statistically significant.

2. Does Viewing a Video in Different Captioning Modes Affect the Incidental Recall of MWEs’ Form?

MWE form recall was measured using pretests, immediate posttests, and delayed posttests through viewing under different on-screen language modes. The descriptive analysis presented in Table 6 shows that all groups began with similar levels of MWE form recall knowledge at pretest. Mean scores were comparably low across groups: No Caption (M = .47, SD = 2.36), Whole Caption (M = .33, SD = 2.03), and the MWE-Enhanced Caption (M = .31, SD = 2.46). Following the viewing, all groups showed slight improvement on the immediate posttests. The MWE-Enhanced Caption group showed the highest improvement (M = 1.00, SD = 1.52), followed by No Caption (M = .81, SD = 1.55). and Whole Caption (M = .50, SD = .84). Delayed posttest scores indicated slight retention across different captioning types, with No Caption group (M = .81, SD = 1.83) outperforming the caption groups (Whole Caption: M = .42, SD = .62; Enhanced Caption: M = .69, SD = .99). Overall, incidental learning of MWE form recall was limited in Form Recall gains, suggesting that although Enhanced Captions may facilitate immediate learning of MWEs, the retention over time appears inconsistent.
Figure 5 presents the estimated marginal means of the MWE form recall between groups in pretest, immediate post-test, and delayed post-test. All the groups scored low on the pretest. The Enhanced Caption outperformed the other groups after the intervention, but No Caption retained the most MWEs knowledge on form recall in the delayed posttest, which is unexpected. The difference between captions and no captions was not statistically significant.
A repeated-measures ANOVA revealed a significant effect of Time on MWE recall, F(2,94) = 5.56, p = .005, indicating that participants’ recall scores improved over time. Consistent with the MWE recognition findings, participants’ MWE recall improved over time due to audiovisual input, but differences between caption conditions were not statistically significant. These results indicate that varying on-screen language modes did not influence MWE learning for beginner-level learners.

V. DISCUSSION AND IMPLICATION

The present study investigated incidental learning of MWEs through different on-screen language modes in Korean EFL novices. Form recognition of MWEs was examined under three captioning conditions: No Caption, Whole Caption, or MWE-Enhanced Caption. All groups showed moderate gains in immediate and delayed post-tests. Although differences were not statistically significant, both caption groups outperformed No Caption group, with the MWE-Enhanced Caption achieving the highest scores. These findings align with previous studies on the benefit of captions (Majuddin et al., 2021; Pu et al., 2024; Teng, 2019) but contrast with Dang et al. (2022a) and Puimège and Peters (2020), who found higher gains for learners without captions. Overall, results suggest that captions may facilitate MWE recognition, although empirical evidence for MWE-enhanced captions remains limited. Unlike this present study, Teng (2019) reported that whole captions were the strongest predictor of MWE learning; however, similar to the present study, no significant differences were found among no captions, keyword captions, and full captions. Although learners did not achieve good results for the targeted MWEs in form recognition tests, their improvement on untargeted items suggests that audio-visual input can promote incidental learning beyond.
For form recall, gains were modest. The MWE-Enhanced Caption group performed best in the immediate posttest, while the No Caption group outperformed the Whole Caption group. These results partially support Puimège and Peters (2020), who indicated that participants scored better results without captions, and align with Majuddin et al. (2021), who reported the benefits of enhanced captions for form recall of MWEs. Although audio-visual input has been examined in incidental MWE acquisition, there remains insufficient empirical research on the specific effects of MWE-enhanced captions. The findings may suggest that participants with low proficiency were not enable to understand or process MWEs through single watching. In addition, their lower English proficiency may affect the learning of MWEs (Pujadas & Muñoz, 2019). The survey of viewing habits presented the evidence that almost half of the participants did not recognize MWEs. Many of them suggested repeated watching would be a better solution to learn MWEs. The present study contributes to understanding differences in MWE development by exploring both receptive and productive dimensions of MWEs. Consistent with the pilot study prediction, participants in this study showed greater receptive knowledge (form recognition) than productive one (form recall). The findings suggested pedagogically that audiovisual input can support MWE acquisition and fluency development if coupled with onscreen languages and MWE test types.
Skill Acquisition Theory (DeKeyser, 2020) indicated that declarative knowledge (form recognition) serves as a foundation for procedural knowledge (form recall). Learners could first develop MWEs by noticing and storing MWEs. Especially, audio-visual input can provide authentic contexts and collective modes of sounds, captions, and images to facilitate learners’ noticing of MWEs. Repeated exposure can also accelerate the progress from recognition to recall, which aligns with the study of Majuddin et al. (2021). Finally, learners may move to implicit and automatic MWE retrieval and production. Pedagogically, audiovisual input can effectively develop MWE learning and further fluency, carefully paired with captioning types. It may be more beneficial if L2 classroom activities and tasks are designed considering the stages of knowledge development from form recognition (e.g., noticing MWEs) to form recall (e.g., writing MWEs) tasks. The integration of both activities may make it easier for learners to develop fluent use of MWEs.
This study has several limitations. First, the sample size was small and included a limited range of proficiency. Second, the content topic was not selected, considering participants’ needs: a more engaging topic might elicit greater attention and learning. Third, the length and difficulty of the audiovisual input were not systematically examined. So, future research on MWE acquisition should address these limitations to provide more generalized and valuable results.

Fig. 1.
MWEs Form Recall Test
stem-2025-26-4-24f1.jpg
Fig. 2.
MWEs Form Recognition Test
stem-2025-26-4-24f2.jpg
Fig. 3.
The Flow of MWE Study
stem-2025-26-4-24f3.jpg
Fig. 4.
Estimated Marginal Means of MWE Recognition
stem-2025-26-4-24f4.jpg
Fig. 5.
Estimated Marginal Means of Measure for MWEs Form Recall
stem-2025-26-4-24f5.jpg
Table 1.
Proficiency Level Test (CELA, Perfect Score 25 Points)
Mode n M SD Minimum Maximum
No Caption 16 8.38 2.65 5 14
Full Caption 18 8.50 2.12 5 14
Enhanced Caption 16 9.88 2.63 5 14
Total 50 8.90 2.51 5 14

Note. CELA = Cambridge English Language Assessment.

Table 2.
Descriptives on MWE Recognition Pretest
Mode n M SD Minimum Maximum
No Caption 16 2.38 2.36 0 7
Whole Caption 18 2.61 2.03 0 6
Enhanced Caption 16 3.06 2.46 0 7
Total 50 2.68 2.25 0 7

Note. The total score of form recognition is 16.

Table 3.
Descriptives on MWE Recall Pretest
Mode n M SD Minimum Maximum
No Caption 16 .47 1.20 0 4.5
Full Caption 18 .33 .59 0 2.0
Enhanced Caption 16 .31 .60 0 2.0
Total 50 .37 .83 0 4.5

Note. The total score of form recall is 16.

Table 4.
Target Item With Corpus Frequency & MI Score and Occurrence in the Input
MWEs COCA MI Occurrence
1 Bear with 795 0.35 1
2 Chicken out 455 4.02 3
3 Go out (went out) 41530 (18307) 4.39 (4.95) 2
4 Freak out 8939 5.44 1
5 Plan out 344 4.43 2
6 Pull over 7469 3.16 1
7 Screw up 10561 4.6 1
8 Settle down 7229 4.15 2
9 Slip out 754 4.12 1
10 A long shot 2918 3.38 1
11 All of a sudden 12095 7.72 1
12 By the way 29897 5.35 2
13 Drink a toast 133 4.37 2
15 Mark one’s words 858 2.51 1
16 Give a speech 6537 3.06 1

Note. MI = mutual information.

Table 5.
MWEs Recognition Tests (Pre-, Immediate, and Delayed)
Time Group n M SD
Pre- Recognition No Caption 16 2.37 2.36
Whole Caption 18 2.61 2.03
Enhanced Caption 16 3.06 2.46
Total 50 2.68 2.25
Immediate Recognition No Caption 16 4.19 3.31
Whole Caption 18 5.39 2.28
Enhanced Caption 16 5.50 3.20
Total 50 5.04 2.94
Delayed Recognition No Caption 16 3.63 3.99
Whole Caption 18 3.83 2.89
Enhanced Caption 16 4.44 3.33
Total 50 3.96 3.36
Table 6.
MWEs Form Recall Tests
Time Group n M SD
Pre- Form Recall No Caption 16 .47 2.36
Whole Caption 18 .33 2.03
Enhanced Caption 16 .31 2.46
Total 50 .37 2.25
Immediate Form Recall No Caption 16 .81 1.55
Whole Caption 18 .50 .84
Enhanced Caption 16 1.00 1.52
Total 50 .76 1.31
Delayed Form Recall No Caption 16 .81 1.83
Whole Caption 18 .42 .62
Enhanced Caption 16 .69 .99
Total 50 .63 1.22

REFERENCES

Anthony, L. (2024). AntConc (Version 4.3.1) [Computer Software]. Waseda University. https://www.laurenceanthony.net/software/AntConc.
Boers, F., Demecheleer, M., Coxhead, A., & Webb, S. (2014). Gauging the effects of exercises on verb-noun collocations. Language Teaching Research, 18(1), 54-74. https://doi.org/10.1177/1362168813505389.
crossref
Boers, F., & Lindstromberg, S. (2012). Experimental and intervention studies on formulaic sequences in a second language. Annual Review of Applied Linguistics, 32, 83-110. https://doi.org/10.1017/S0267190512000050.
crossref
Chang, A., Newton, J., & Webb, S. (2013). Incidental learning of collocation. Language and Learning, 63(1), 91-120. https://doi.org/10.1111/j.1467-9922.2012.00729.x.
crossref
Chang, L. (1994). A psychometric evaluation of 4-point and 6-point likert-type scales in relation to reliability and validity. Applied Psychological Measurement, 18(3), 205-215. https://doi.org/10.1177/014662169401800302.
crossref
Choi, S. (2017). Processing and learning of enhanced English collocations: An eye movement study. Language Teaching Research, 21(3), 403-426. https://doi.org/10.1177/1362168816653271.
crossref
Dang, T. N.Y., Lu, C., & Webb, S. (2022a). Incidental learning of collocations in an academic lecture through different input modes. Language Learning, 72(3), 728-764. https://doi.org/10.1111/lang.12499.
crossref
Dang, T. N.Y., Lu, C., & Webb, S. (2022b). Incidental learning of single words and collocations through viewing an academic lecture. Studies in Second Language Acquisition, 44(3), 708-736. https://doi.org/10.1017/S0272263121000474.
crossref
DeKeyser, R. (2020). Skill acquisition theory. In B. VanPatten & J. Williams (Eds.), Theories in second language acquisition (3rd ed., pp. 83-104). Lawrence Erlbaum Associates Publishers. https://scholar.google.com/citations?view_op=view_citation&hl=en&user=DShIY28AAAAJ&citation_for_view=DShIY28AAAAJ:Tyk-4Ss8FVUC.
Fernandez, B. G., & Schmitt, N. (2015). How much collocation knowledge do L2 learners have? The effects of frequency and amount of exposure. ITL - International Journal of Applied Linguistics, 166(1), 94-126. https://doi.org/10.1075/itl.166.1.03fer.
crossref
Foster, P., Bolibaugh, C., & Kotula, A. (2014). Knowledge of nativelike selections in a L2. Studies in Second Language Acquisition, 36(1), 101-132. https://doi.org/10.1017/S0272263113000624.
crossref
Fryman, P. (Director). (2005, September 19). Pilot (Season 1, Episode 10) [TV series episode]. In C. Bays & C. Thomas (Creators), How I met your mother. Thomas Productions; CBS Productions.
Gass, S., Winke, P., Isbell, D. R., & Ahn, J. (2019). How captions help people learn language: A working-memory, eye-tracking study. Language, Learning & Technology, 23(2), 84-104. https://doi.org/10.64152/10125/44684.
crossref
Hallin, A. E., & Sidtis, D. V. L. (2015). A closer look at formulaic language: Prosodic characteristics of Swedish proverbs. Applied Linguistics, 38(1), 68-89. https://doi.org/10.1093/applin/amu078.
crossref
Jelani, N. A. M., & Boers, F. (2018). Examining incidental vocabulary acquisition from captioned video: Does test modality matter? ITL - International Journal of Applied Linguistics, 169(1), 169-190. https://doi.org/10.1075/itl.00011.jel.
crossref
Krashen, S. (1985). The input hypothesis: Issues and implications. Longman. https://www.scribd.com/document/284431055/The-Input-Hypothesis.
Laufer, B. (2006). Comparing focus on form and focus on forms in second-language vocabulary learning. The Canadian Modern Language Review, 63(1), 149-166. https://doi.org/10.3138/cmlr.63.1.149.
crossref
Lee, S. (2025). The relationship between receptive and productive knowledge of L2 English collocations. International Journal of Applied Linguistics, 35(1), 109-133. https://doi.org/10.1111/ijal.12605.
crossref
Lin, M. P. (2018). The prosody of formulaic sequences: A corpus and discourse approach. Bloomsbury Publishing. https://doi.org/10.5040/9781474205627.
Lindgren, E., & Mu&#x000f1;oz, C. (2012). The influence of exposure, parents, and linguistic distance on young European learners’ foreign language comprehension. International Journal of Multilingualism, 10(1), 1-25. https://doi.org/10.1080/14790718.2012.679275.
crossref
Macis, M., Sonbul, S., & Alharbi, R. (2021). The Effect of spacing on incidental and deliberate learning of L2 collocations. System, 103, Article 102649https://doi.org/10.1016/j.system.2021.102649.
crossref
Majuddin, E., Siyanova-Chanturia, A., & Boers, F. (2021). Incidental acquisition of multiword expressions through audiovisual materials: The role of repetition and typographic enhancement. Studies in Second Language Acquisition, 43(5), 985-1008. https://doi.org/10.1017/S0272263121000036.
crossref
Montero P&#x000e9;rez, M., & Rodgers, M. P. H. (2019). Video and language learning. The Language Learning Journal, 47(4), 403-406. https://doi.org/10.1080/09571736.2019.1629099.
crossref
Montero P&#x000e9;rez, M., Van Den Noortgate, W., & Desmet, P. (2013). Captioned video for L2 listening and vocabulary learning: A meta-analysis. System, 41(3), 720-739. https://doi.org/10.1016/j.system.2013.07.013.
crossref
Montero P&#x000e9;rez, M., & Webb, S. (2018). Incidental vocabulary acquisition through viewing L2 television and factors that affect learning. Studies in Second Language Acquisition, 40(3), 551-577. https://doi.org/10.1017/S0272263117000407.
crossref
Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.). Cambridge University Press. https://doi.org/10.2307/747835.
Nguyen, T. M. H., & Webb, S. (2017). Examining second language receptive knowledge of collocation and factors that affect learning. Language Teaching Research, 21(3), 298-320. https://doi.org/10.1177/1362168816639619.
crossref
Pellicer-S&#x000e1;nchez, A. (2017). Learning L2 collocations incidentally from reading. Language Teaching Research, 21(3), 381-402. https://doi.org/10.1177/1362168815618428.
crossref
Peters, E. (2018). The effect of out-of-class exposure to English language media on learners’ vocabulary knowledge. International Journal of Applied Linguistics, 169, 142-168. https://doi.org/10.1075/itl.00010.pet.
crossref
Peters, E., Noreillie, A-S., Heylen, K., Bulte, B., & Desmet, P. (2019). The impact of instruction and out-of-school exposure to foreign language input on learners’ vocabulary knowledge in two languages. Language Learning, 69(3), 747-782. https://doi.org/10.1111/lang.12351.
crossref
Peters, E., & Webb, S. (2018). Incidental vocabulary acquisition through viewing L2 television and factors that affect learning. Studies in Second Language Acquisition, 40(3), 1-27. https://doi.org/10.1017/S0272263117000407.
crossref
Pu, P., Chang, D. Y., & Wang, S. (2024). Incidental learning of collocations through different multimodal input: The role of learners’ initial L2 proficiency. System, 125, Article 103416https://doi.org/10.1016/j.system.2024.103416.
crossref
Puimège, E., Perez, M. M., & Peters, E. (2021). Promoting L2 acquisition of multiword units through textually enhanced audiovisual input: An eye-tracking study. Second Language Research, 39(2), 1-40. https://doi.org/10.1177/02676583211049741.
crossref
Puimège, E., & Peters, E. (2019). Learning L2 vocabulary from audiovisual input: An exploratory study into incidental learning of single words and formulaic sequences. The Language Learning Journal, 47(4), 424-438. https://doi.org/10.1080/09571736.2019.1638630.
crossref
Puimège, E., & Peters, E. (2020). Learning formulaic sequences through viewing L2 television and factors that affect learning. Studies in Second Language Acquisition, 42(3), 525-549. https://doi.org/10.1017/S027226311900055X.
crossref
Pujadas, G. (2019). Language learning through extensive TV viewing: A study with adolescent EFL learners. (Publication No. 146118) [Doctoral dissertation, University of Barcelona]. https://diposit.ub.edu/dspace/bitstream/2445/146118/1/GPJ_1de2.pdf.
Pujadas, G., & Mu&#x000f1;oz, C. (2019). Extensive viewing of captioned and subtitled TV series: a study of L2 vocabulary learning by adolescents. The Language Learning Journal, 47(4), 479-496. https://doi.org/10.1080/09571736.2019.1616806.
crossref
Pujadas, G., & Mu&#x000f1;oz, C. (2020). Examining adolescent EFL learners’ TV viewing comprehension through captions and subtitles. Studies in Second Language Acquisition, 42, 551-575. https://doi.org/10.1017/S0272263120000042.
crossref
Saito, K., & Liu, Y. (2022). Roles of collocation in L2 oral proficiency revisited: Different tasks, L1 vs. L2 raters, and cross-sectional vs. longitudinal analyses. Second Language Research, 38(3), 531-554. https://doi.org/10.1177/0267658320988055.
crossref
Schmitt, N. (2010). Researching vocabulary: A vocabulary research manual. Palgrave Macmillan. https://doi.org/10.1057/9780230293977.
Shin, D., Lee, J. H., & Choi, W. (2023). An exploratory study of your EFL learners’ aural and written receptive multiword unit knowledge. System, 114, Article 103029https://doi.org/10.1016/j.system.2023.103029.
crossref
Siyanova-Chanturia, A., & Pellicer-Sánchez, A. (2019). Formulaic language: Setting the scene. In A. Siyanova-Chanturia & A. Pellicer-Sánchez (Eds.), Understanding formulaic language: A second language acquisition perspective (pp. 1-15). Routledge. https://doi.org/10.4324/978131526615.
Siyanova-Chanturia, A., & Sidtis, D. V. L. (2019). What online processing tells us about formulaic language. Routledge. https://www.taylorfrancis.com/chapters/edit/10.4324/9781315206615-3/online-processing-tellsus-formulaic-language-anna-siyanova-chanturia-diana-van-lancker-sidtis.
Suarez, M. M., & Gesa, F. (2019). Learning vocabulary with the support of sustained exposure to captioned video: Do proficiency and aptitude make a difference? The Language Learning Journal, 47(4), 497-517. https://doi.org/10.1080/09571736.2019.1617768.
crossref
Tavakoli, P., & Uchihara, T. (2020). To what extent are multiword sequences associated with oral fluency? Language Learning, 72(2), 506-547. https://doi.org/10.1111/lang.12384.
crossref
Teng, M. F. (2019). The effects of video caption types and advance organizers on incidental L2 collocation learning. Computer & Education, 142, Article 103655https://doi.org/10.1016/j.compedu.2019.103655.
crossref
Teng, M. F. (2022). Incidental L2 vocabulary learning from viewing captioned videos: Effects of learner-related factors. System, 105, Article 102736https://doi.org/10.1016/j.system.2022.102736.
crossref
Ullah, I., Kim, S., & Ibtissam, A. (2024). Measuring English receptive and productive vocabulary of Pakistani university students across frequency levels. Korean Journal of English Language and Linguistics, 24, 708-734. https://doi.org/10.15738/kjell.24..202407.708.
crossref
VanPatten, B. (2007). Input processing in adult second language acquisition. In B. VanPatten & J. Williams (Eds.), Theories in second language acquisition: An introduction (pp. 115-135). Lawrence Erlbaum Associates Publishers. https://psycnet.apa.org/record/2006-20180-007.
Vu, D. V., Noreillie, A.-S., & Peters, E. (2023). Incidental collocation learning from reading-while-listening and captioned TV viewing and predictors of learning gains. Language Teaching Research, 1-34. https://doi.org/10.1177/13621688221151048.
crossref
Webb, S. (2008). Receptive and productive vocabulary sizes of L2 learners. Studies in Second Language Acquisition, 30(1), 79-95. https://doi.org/10.1017/S0272263108080042.
crossref
Webb, S., & Chang, A. C.-S. (2022). How does mode of input affect the incidental learning of collocation? Studiesin Second Language Acquisition, 44(1), 35-56. https://doi.org/10.1017/S0272263120000297.
crossref
Yilmaz, S., & Kavanoz, S. (2025). Measuring the effect of receptive and productive vocabulary size on foreign language skills. Porta Linguarum, 44, 29-45. https://doi.org/10.30827/portalin.vi44.31870.
crossref
Yu, X., Boers, F., & Tremblay, P. (2025). Learning multiword items through dictation and dictogloss: How task performance predicts learning outcomes. Language Teaching Research, 29(6), 2658-2678. https://doi.org/10.1177/13621688221117242.
crossref
Zheng, Y. (2009). Exploring Chinese EFL learners’ receptive and productive vocabulary knowledge: Implications for EFL vocabulary teaching. Journal of Asia TEFL, 6(1), 163-188. https://www.researchgate.net/publication/313026143.
Zhou, S. (2010). Comparing receptive and productive academic vocabulary knowledge of Chinese EFL learners. Asian Social Science, 6(10), 14-19. http://doi.org/10.5539/ass.v6n10p14.
crossref


ABOUT
BROWSE ARTICLES
EDITORIAL POLICY
FOR CONTRIBUTORS
Editorial Office
#1219, Bugak building, Kookmin University,
Jeongneung-ro 77, Seongbuk-gu, Seoul 02707, Korea
E-mail: stem@stemedia.co.kr                

Copyright © 2026 by The Society for Teaching English through Media.

Developed in M2PI

Close layer
prev next