I’ve seen many articles and rave reviews claiming that ChatGPT’s translation capabilities are on par with DeepL and Google, and sometimes even surpass them. As the founder of a company (
Should I be worried about such a strong competitor?
To compare the quality of translation, we prepared test datasets for seven language pairs:
Each test dataset contains around 2500 lines and includes sentences of various themes, lengths, styles, and formatting, to eliminate text selection bias for a specific translator.
ChatGPT recently released an API version 4 for limited access. At the moment, access is only available to previously created accounts that have already paid for version 3.5.
According to reviews, the new version has made significant advancements in terms of quality compared to version 3.5. We will check that too!
For testing, we will use two metrics: BLEU and COMET.
BLEU — a widely recognized standard for testing the quality of translation. By default, we will use the Sacre Bleu version. This version is used in the MT machine translation conference and various international competitions.
In this metric, the comparison of translations is based on the number of n-grams (combinations of words) that follow one another. The goal of the metric is to find the maximum matching combinations between the translation made by a human and the one made by a machine.
The comparison begins with clusters of four words. If there are none, it searches for three-word n-grams. If further matches are not found, it can go down to one n-gram. Points are awarded for each sequence of words (tokens) that the program finds.
The drawback of the metric is that it does not account for synonyms, and if the translation accurately conveys the thought but with different words, it will show 0.
COMET — A metric designed to solve the problem of comparing synonyms, which metrics based on the symbolic comparison of two strings cannot handle. If the result of the translation is a semantically similar phrase, but described with different words, the metric will indicate comparable results.
It should be noted that its result will also depend on the diversity of the language corpus used to construct the comparison classifier. This metric is widely used as an alternative to the BLEU metric.
Prompts we used for ChatGPT translation:
You are TranslateGPT. You translate user messages from English to Italian (Finnish / French / German / Portuguese / Russian / Spanish). You are the most accurate English to X translator in the world.
Below are graphs with test results:
This language pair shows a noticeable improvement in the translation quality of ChatGPT 4 compared to version 3.5. According to the COMET metric, ChatGPT4 slightly outperforms Lingvanex.
When it comes to translating into German, the situation is the same as with French. However, Lingvanex’s lag in the COMET metric is minimal.
When it comes to translating into German, the situation is the same as with French. However, Lingvanex’s lag in the COMET metric is minimal.
Let’s compile all the differences in a table. In red font, we’ll indicate where ChatGPT falls short of Lingvanex. In green font, we’ll mark where it surpasses. The data is relevant as of July 31, 2023.
The Lingvanex translation price was calculated based on the cost of a month’s rent for a basic GPU server (150 dollars) + the monthly price for a Lingvanex language model translation (from 100 dollars), and the number of characters that can be translated in a month with this configuration.
The test results show that while ChatGPT 3.5 is mostly inferior to Lingvanex in translation quality, according to the COMET metric, ChatGPT4 often matches Lingvanex. It’s worth noting that currently, translating large volumes of text with ChatGPT4 is very expensive.
In order to perform the tests for this article and translate roughly 20,000 lines through ChatGPT4, a total of $45 was expended. The calculation of the translation cost can be confusing, as it’s difficult to estimate in advance, in tokens, how much you’ll end up paying for the translation.
At the moment, the translation speed of ChatGPT 4 is unstable, most likely depending on the current load on their servers. We had to take 3–4 second breaks between requests. Overall, the translation speed with the test dataset was about 8 words per second.
Our solution enables the translation of several thousand words per second, even on weaker servers.
In addition, we observed censorship in the translations: if a line contains obscene language, ChatGPT will not proceed to translate the entire sentence.
Therefore, ChatGPT is better used for the stylistic translation of small volumes of text without special security requirements. Furthermore, styles and themes can be altered on the go.
By carefully selecting prompts, you can achieve improved quality tailored to a specific task, but this requires going through a significant number of prompts.
Lingvanex translation solutions are better used where large volumes of translation are required at a low cost, with safety, speed, and stability.
I admit that for some language pairs, the difference in translation quality may be different, but testing all possible pairs is lengthy and costly.
In general, solutions from ChatGPT and Lingvanex are designed for different purposes and should be chosen depending on the task.
If our company’s solution suits you, we offer a free trial of our server, mobile SDK, and Cloud API. Our company also offers voice recognition solutions (audio to text).
If you want to test our solution, write to info@lingvanex.com