Augmented Quality Assessment (AQuA)

We are currently piloting AQuA in a small number of Bible Translation field trials. In early 2024, AQuA will be available to all interested users. If you would like to try out AQuA, please fill out this contact form to receive further instructions once the tool is ready for general use.

What is AQuA?

AQuA is a copilot tool developed by SIL International that seeks to develop capacity and increase the thoroughness of translation quality assurance. It harnesses artificial intelligence (AI) techniques, assisting translation teams in objectively assessing multiple facets of translation quality by enhancing and complementing preexisting translation work. AQuA is able to produce an increasingly detailed suite of quality-related diagnostics, with several augmented quality assessment methods in the areas of accuracy, naturalness, and clarity.

To this end, AQuA is creating:

an early warning system for translators, which will alert them to obvious problems so they can address them earlier in the translation process;
power tools for translation consultants, which will allow them to do their checking work more thoroughly and more consistently;
an equalizer for administrators and strategists, which will allow them to compare and evaluate methodologies and products on an equal footing (including new AI-based translation methodologies proposed by TBTA, SIL, and Avodah).

By identifying anomalous passages in a draft, AQuA accelerates the quality assessment process by helping:

translators make corrections (and produce an improved draft) prior to consultant checks;
consultants gain an “at-a-glance” understanding of quality across a draft and quickly dig into the relevant, granular quality issues; and
administrators and project managers track quality across projects to guide strategic allocations of checking resources.

Our hope is that AQuA would empower translation teams in the quality checking process, helping them to more easily spot the obvious errors, which in turn lets them focus more of their precious time and attention on the parts of a draft that really need their expertise.

AQuA as an AI Copilot

We are seeing a recent explosion of interest in AI-driven tools that are paired as “copilots” with human workers. These include copilots that are helping users to write code, find sales leads, draft legal documents, generate design ideas, create lesson plans, and much more.

The use of the term copilot in these instances emphasizes two important points:

A person is in control of the process (i.e., they are the pilot)
The person's work is enhanced when it is assisted by AI-driven tooling (in terms of efficiency, thoroughness, creativity, or some other key performance indicator)

In Bible translation, there are individuals (like translators and consultants) who undergo years of specialty training to diagnose quality issues in translation drafts. Computational methods, such as AQuA, are increasingly able to help these translation experts locate anomalies and issues within translation drafts relating to the areas of accuracy, naturalness, and clarity that would likely not have been found otherwise. However, AQuA will fail to identify certain edge case quality issues in the areas of accuracy, naturalness, and clarity. Furthermore, AQuA is not designed to diagnose issues related to many of the other important threads of the multithreaded translation quality fabric. For example, a tool like AQuA is of no help in assessing the beauty or acceptability of a translation.

Therefore, AQuA acts as a copilot, helping stakeholders in the translation process to more efficiently and thoroughly evaluate certain qualities of a draft translation.

It is important to note that when a copilot like AQuA augments a translation team's capabilities, the focus is not primarily on speeding up the work, even though that is often a nice bonus. The real focus is on improving the quality of what the translation team is able to do in the allotted amount of time.

What does AQuA Measure?

The quality of a translation is not a singular entity; rather, it is best approached as a whole set of qualities that make up the warp and woof of a complex fabric. With AQuA, we are not attempting to come up with a singular estimate of quality, but only to assess a variety of component qualities. Even at that, we are not attempting to assess every quality, but only a selection of qualities that are within reach of AI.

With AQuA, we do not purport to assess such qualities directly; we only deal with qualities that can be measured algorithmically and expressed as a number. These measurable qualities are never an exact match to one of the abstract qualities (such as accuracy, naturalness, or clarity); instead they work like proxies for key aspects of the abstract qualities. For example, the AQuA metrics for Word Correspondence and Semantic Similarity are both quantifiable measures that relate to the accuracy of a translation, but do not give a full picture of its accuracy.

Methods of Assessment

A list of AQuA assessments are provided below, grouped under the abstract qualities of accuracy, naturalness, and clarity:

Accuracy

Word Correspondence

The word correspondence assessment gives insight into how closely words align between two translations. Lower scores indicate less "word-for-wordness", while higher scores indicate more "word-for-wordness". Scores are calculated for each verse.

How it works:

Word correspondence scores are a measure of AQuA’s confidence in its ability to align functionally equivalent words between two translations. Although the user may select any two translations on which to run a word correspondence assessment, the most common use case is comparing a draft translation to a reference or model translation.

The word correspondence algorithm works in several steps. First, we use a variety of computational techniques in an attempt to align corresponding words between the two texts.

Here are the algorithms we use for word alignment:

Fast Align (the same algorithm used in the Paratext Interlinearizer)
Jaccard Similarity, which matches words based on their verse co-occurences between the two translations
Cosine similarity between high-dimenstional word embeddings (vectors), produced by an AI autoencoder model

All of the above algorithms train only on the Bible texts being analyzed for the word correspondence assessment. In other words, no pretraining data is required.

Each of these algorithms yields a score between 0 and 1 for every possible combination of word alignment pairs in each verse, representing the algorithm's confidence in each word alignment pair.

The word correspondence scores are calculated by averaging out the scores produced by the above algorithms for each word pair, and then, for each word, finding its most likely alignment (i.e. the alignment pair with the highest confidence score). Verse-level scores are calculated by averaging out these high-confidence word pairs across an entire verse. We then multiply the scores by 10, to produce a final score between 0 and 10.

We tokenize words by whitespace, so this assessment currently does not work for non-whitespace languages, such as Thai. This assessment also yields more accurate results when used on isolating languages, as opposed to polysynthetic languages. Future research will focus on non-whitespace, subword, and phrase-level tokenization to improve performance on a wider variety of languages.

Semantic Similarity

The semantic similarity assessment gives insight into differences in meaning between two translations. Lower scores indicate more difference in meaning, while higher scores indicate less difference in meaning. Scores are calculated for each verse.

How it works:

Although the user may select any two translations on which to run a semantic similarity assessment, the most common use case is comparing a back translation of a draft to a reference or model translation. For best results, the two translations being compared should be in languages of wider communication.

We use Google's Language Agnostic BERT Sentence Embeddings (LaBSE) model to assign a score from 0 to 10 to each verse pair. LaBSE is pretrained on a large dataset of sentence pairs in 109 languages of wider communication. You can read more about the LaBSE model here, or read the model paper here.

Missing and Added Words

The missing and added words assessment provides insight into words in one translation that have no likely equivalent in another translation.

How it works:

Although the user may select any two translations on which to run a missing and added words assessment, the most common use case is comparing a draft translation to a reference or model translation. We will refer to these two texts as the "project" and "comparison" texts, respectively.

To identify potential missing words, we build off of the word alignments generated during the word correspondence assessment. After running the word correspondence algorithm between the two selected translations, we then run it again between the comparison translation, selected by the user, and several other high-quality reference translations. Finally, we compare results across these assessments.

Words in the comparison text that have no likely alignment in the project text, but do have a likely alignment in each of the other reference translations, are flagged as potentially missing from the project text.

The process for identifying potential added words in the project text is similar. We once again examine the alignments between the project and comparison texts. If a word in the project text has no likely equivalent in the comparison text, it is flagged as potentially added.

Naturalness

Sentence and Word Length (Lix Readability)

The sentence and word length assessment uses a modified version of the Lix readability formula as a proxy for the readability of a text. Scores are calculated for each chapter.

How it works:

Lix readability is a formula based on the word and sentence length of a text. Lix readability works better for estimating the readability of non-English text than other readability metrics like Flesch Kincaid, which contain English-specific linguistic information.

The Lix readability formula is as follows:
% of long words in the text + average words per sentence in the text
where a word is classified as "long" if it contains a certain number of letters.

In AQuA, this formula becomes:
% of words in chapter over 7 characters long + average # of words per sentence in chapter

This assessment works best for languages written with an alphabetic script. We use whitespace to tokenize the text into words. This assessment does not take into account word choice or syntax, which also factor into a text's readability.

Clarity

Comprehensibility (Coming Soon)

The comprehensibility assessment gives insight into key information that may have been omitted from a translation.

How it works:

The comprehensibility assessment draws from a dataset of question-and-answer pairs that ask about expected key information in a verse. This assessment can only be run on translations in languages of wider communication (including back translations).

We provide a verse from the selected translation as context to a question-answering model. We then ask that model a question pertaining to that specific verse from the dataset. We then compare the model's answer to the expected answer from the dataset. If the model's answer differs significantly from the expected answer, it suggests the possibility of missing or incomplete information within the verse.

Augmented Quality Assessment (AQuA)

What is AQuA?​

AQuA as an AI Copilot​

What does AQuA Measure?​

Methods of Assessment​

Accuracy​

Word Correspondence

How it works:

Semantic Similarity

How it works:

Missing and Added Words

How it works:

Naturalness​

Sentence and Word Length (Lix Readability)

How it works:

Clarity​

Comprehensibility (Coming Soon)

How it works:

What is AQuA?

AQuA as an AI Copilot

What does AQuA Measure?

Methods of Assessment

Accuracy

Naturalness

Clarity