Blog

Technology

Hands-on experience

AI agents in an overlooked world. A story of developing modern speech tech in Uzbekistan

Anton Shestakov

Feb 7, 2025

Hello everyone! We are Aiphoria — a company developing solutions that automate routine employee work across organizations. There’s a lot of buzz around AI Agents lately, with some calling it one of the major trends for 2025. We call our virtual employees “AI Pros” and develop them as turnkey solutions for various business needs — from customer support to recruitment and much more.We have developed our own next-gen NLP platform (Aiphoria Pro Platform) that makes it incredibly easy to integrate AI Pros into almost any business, even if its operations are not just text- but also voice-centric. It’s adaptable to any specific knowledge domain and language, features an open architecture, supports on-premise deployment, and most importantly — our pros can achieve set goals thanks to our flexible GOAL architecture (GOAL stands for Goal-Oriented Architecture for LLM). This means we ensure that our clients’ customers are guided not just through a predetermined rule-based scenario but toward a specific objective, which can range from selling something to solving a particular issue in tech support scenarios. All of this is thanks to advanced implementation of GenAI, especially LLM’s conversational capabilities. We’ll definitely cover this approach in more detail in our future posts.You’ve probably read numerous stories about how such things work for the English language, so we won’t dwell on the details here. Instead, we wanted to dedicate this first post to a challenge we encountered in one of our cases, which is quite representative of the current state of AI in general.As we’ve mentioned before, we can help with not just text-based scenarios but also with voice-centric ones — we find the possibilities of current LLMs for these especially exciting. While speech tech in English-speaking countries is nothing new, trying to achieve the same for, say, Central Asian countries is a challenge of a different magnitude. These countries are often overlooked in terms of AI technologies, despite being developed nations with unique cultures and massive digitized businesses across various sectors, which makes them data-rich environments. Today we’ll share one such case study — a collaborative effort with TBC Uzbekistan, the leading digital banking ecosystem in Central Asia, where we worked together to build and implement full-fledged AI agents. Despite our team’s experience working in big companies on various challenging products, what we’ve achieved together in less than a year will always have a special place in our hearts and makes us really proud; we truly consider it a remarkable tech achievement by any means.

Image credit: TBC Uzbekistan

The Challenge

Let’s summarize the challenges we tackled together. While we brought the technical expertise, TBC Uzbekistan brought deep banking knowledge and a clear vision of what they wanted to achieve: they needed virtual operators to ease the workload of their call center across several scenarios — including support, payment reminders, sales, and potentially much more. They played a key role in shaping these solutions, providing quick feedback and valuable insights from their digital banking operations. Given the sensitive nature of banking data, data confidentiality was paramount from day one, which made our platform on-premise capabilities a perfect fit for this partnership.

The challenge was to make everything work for the not very common Uzbek language. What does it take for us to localize Aiphoria Pros for a new language? We need to create speech synthesis and recognition, support NLP models, and train LLM to support this language with decent quality. Let’s start with speech synthesis and recognition first, two related but uniquely challenging tasks. But before that — some geographical and cultural context is probably needed.

Uzbekistan is a vast and beautiful country in Central Asia bordering Tajikistan, Kyrgyzstan, and Kazakhstan. It is also the most populous country in the region, with more than 37 million people. It is a former USSR country with several languages in use, which are often mixed together. This is especially relevant in border regions, where people might speak Tajik or Karakalpak — you probably haven’t heard of the latter, but it has 2 million native speakers.

Uzbekistan is a diverse country in more than one way as evident by these photos taken from various sources, including our founder’s iPhone

Add to this various dialects, such as Tashkent, Surkash, Jizzakh, Namangan, and Khorezm dialects, each sounding quite different. Another interesting and important fact is the country’s relatively recent transition from Cyrillic to official Latin script, which many people still might not know. Additionally, a lot of folks often mix languages in their day-to-day communication, speaking several languages even in one sentence. Consider all this — and you get an extremely interesting challenge for any ML engineer trying to build speech tech here. And let’s not forget the traditional challenges of speech analysis from phone call data in most countries: poor call quality, calls from subway stations, traffic jams, villages with bad mobile connection etc.

Blog readers might not know Uzbek and Russian languages, so let’s try to imagine how this would sound with more familiar Spanish and English languages. Imagine receiving a call where a barely audible voice says something like: “Hey there, es un bank? My card no trabajo and I’m in deep desierto with connection débil”. Only it’s not in English and Spanish, it’s in Uzbek with a mix of Russian, and it’s more or less regular speech that you can hear on the streets.

The state of Uzbek Automatic Speech Recognition

Before building our own solution for Uzbek ASR, we needed to understand what already existed in the market. It turned out that several solutions already exist; however, they all have various issues that make them unfit for our purpose.

Evaluation of quality for such products is usually done using SER/CER (sentence/character) metrics or WER (Word Error Rate) metric. We’ve used the latter, which is a standard, widespread metric for measuring speech recognition quality that shows the percentage of errors per given number of words. The lower the percentage, the better.

On paper, all existing solutions show good results — for instance, “META seamless_communication” showed 17.33% WER on the FLEURS dataset. However, when faced with real raw data from TBC Uzbekistan’s call center, with usual phone call quality and specifics of banking vocabulary, the actual quality of existing solutions turned out to be at best around 40% WER. To roughly explain what this means, you can imagine that in an Uzbek sentence of 10 words, existing solutions correctly recognize only 6. Unfortunately, this is an unacceptably low result.

Without proper ASR, AI agents that have to work with voice channel are not possible, as this is the base of it all. Too many mistakes in recognition will eventually lead to LLMs not getting what the problem in question is and simply not working properly. That’s why we had no choice but to start building a new speech recognition system specifically for TBC Uzbekistan.

Building our own ASR

The foundation of any recognition system is well-labeled datasets. Fortunately, the recent trend is global open-source solutions, both from large companies like Meta and startups. Sharing training data is beneficial for everyone because it’s impossible to achieve good results quickly in isolation — constant retraining is needed, and each new project using open data has the potential to improve that data’s quality. This is relevant even with complex Foundation Models, as evidenced by Stanford’s research here:

Artificial intelligence foundation models worldwide, 2021 to 2023, by access type (Statista)

This is also relevant for datasets in different languages. In the case of Uzbek, the team from Uzbek Voice worked in partnership with Uzbekistan’s IT Park and decided to make their impressive 1000+ hours Uzbek speech dataset publicly available. There are also other examples like Mozilla Common Voice, which is much smaller, and the multi-language FLEURS dataset with just a dozen hours of Uzbek, to name a few.

Unfortunately, this is not enough to build a functioning ASR for our purposes. Big datasets are never perfect, and there are usually many mistakes in transcriptions. For instance, while analyzing one of these datasets, we came across historical Uzbek verses transcribed with obsolete terminology that proved impossible to identify within the text corpus. And this is just one example. Most importantly, however, these datasets lack necessary data from conversations with banks — all the specifics of the banking lexicon, certain terms, and definitions. That’s why the call center data from TBC Uzbekistan, which we’ve mentioned before, was crucial.

So we have this labeled but not perfect data from public datasets and anonymized unlabeled crucial data from the TBC Uzbekistan call center. Sounds like quite a task for a labeling team! To understand its importance better, let’s first understand what data labeling essentially is.

Data labeling is another foundation of effective ASR systems. The process starts with creating high-quality audio-text pairs, where human experts carefully transcribe speech recordings, noting everything from words to emotional tones. Think of it as building a comprehensive translation dictionary between spoken and written language.

But raw transcriptions aren’t enough. The next crucial step is normalization — transforming these transcripts into a standardized format. This cleanup process ensures consistency across different speakers and accents, making it easier for AI models to learn patterns and generate accurate transcriptions. It’s like establishing a universal grammar that helps the system understand and process speech more effectively, regardless of who’s speaking.

Remember how we mentioned earlier that Uzbekistan switched from Cyrillic to Latin script? And how there is a mix of different languages and dialects? All of this made building a data labeling team a non-trivial task. For example, there aren’t many people who know the official Uzbek language (i.e., Latin writing). Therefore, we first had to build expertise internally in the bank and bring in linguists who could explain the nuances of grammar and dialects.

To this day there’s a mix of scripts and languages even on the signs of street shop vendors © Yandex Maps

Normalizing both existing labeled datasets and our annotators’ output was quite the adventure. We had to standardize everything from various spellings of names and places to some rather peculiar linguistic cases.

Let us share some examples. In the official Uzbek language, the letters ‘h’ and ‘x’ create an interesting confusion — the ‘x’ isn’t pronounced like in English but almost identically to ‘h’, just slightly firmer. Given the recent transition from Cyrillic to Latin alphabet, this led to multiple transcription variants. Take the word “xo’p” (equivalent to “OK” in English) — we’ve seen it written as “hop”, “ho’p”, “hup”, and “xop”. While these variations are commonly used, even by our annotators, they’re technically incorrect. The word is so prevalent in the language that sometimes almost an entire conversation can consist of “xo’p”. This would occasionally lead to our model “getting creative” and translating some similarly sounding names and surnames as “Xop Xopich” (the English equivalent would be something like “Mr. Okay Okayson”). Without standardizing these variations, we’d risk sounding illiterate.

Then there’s the very common phrase “assalomu alaykum,” which means “peace be upon you” but is basically used as “hello”. We used the correct form in the previous sentence, but there are many examples of incorrect spellings like “assalomy,” “assalamu,” “alekum,” “aleykum” and many others with different combinations — all of these need to be normalized. Again, it’s not that people are illiterate; it’s all part of a recent transition from the Cyrillic to Latin alphabet — people are not yet used to the different letters.

Another challenging case involved the ‘-lik’ suffix with numerals. Even our linguistic expert had to consult external colleagues to crack this one. We discovered that ‘-lik’ should only be used when a numeral represents something collective or characteristic of a group — like area codes for phone numbers — but not with card numbers. Getting these subtle distinctions right was crucial; otherwise, our model could misinterpret user inputs and generate incorrect responses.

In total, we had to make over 7,000 rules for such normalizations.

Just to name a few

Another challenge was learning to identify specific vocabulary, as not all annotators understand certain banking nuances. For instance, team members didn’t understand what “KATM” was, despite its frequent mention in calls — it’s simply the equivalent of a credit bureau. In such cases, we often encounter “acoustic illusion” — when all five people are absolutely certain they’ve correctly transcribed what was said, but none of them are right. This is partially solved by properly selected datasets — we based ours on customer-operator conversation topics and created a similar domain for the training corpus. Part of the internal dataset was collected with help from call center operators, whom we asked to record phrases containing useful vocabulary and internal bank abbreviations. In some cases, we had to bring in people from other departments to help resolve disputed cases.

As a result, our data labelling team became so proficient that some team members started receiving comments from friends about how grammatically correct their chat messages had become.

A real conversation where a member of our team politely asks to send photos and gets a response that he talks ‘like a teacher,’ meaning he speaks very ‘properly.’

Now we have sufficient well-annotated data and have created our own working ASR — likely the world’s best for the Uzbek language. This is confirmed by subsequent production tests and specific results on public datasets.

Speech Synthesis

English-speaking audiences are no strangers to good speech synthesis — from voice assistants to navigation and customer support in various services. The situation is also good with other popular languages, at least with most European ones. However, there was no on-premise deployable model for the Uzbek language with good enough quality in both ASR and TTS to address TBC use cases (bilingual and tonally diverse bots mimicking humans in phone calls).

Typically, people in Uzbekistan hear what is called unit selection synthesis — where pre-recorded phrases change depending on the situation. For example, in a pre-recorded sentence, a number, street name, or name might change, but the sentence structure remains unchanged. It often sounds quite robotic and unconvincing. While all speech synthesis once worked this way, for many languages there is now synthesis that is sometimes barely distinguishable from human speech. Why should Uzbek be any different? That’s why we set out to create exactly this kind of quality with full-fledged neural network-based synthesis where you can synthesize any word from scratch — not just a pre-recorded one. It solves the bank’s tasks much better — both as human support and for other scenarios like payment reminders.

At its core, a modern TTS pipeline consists of three essential components. The first is the NLP component, which handles all text processing and produces “clean” text input, properly formatted without numbers or special punctuation marks. It is quite a challenge by itself, especially in languages with complex declension of words like in Russian and Uzbek — a simple “4%” is ofter written like this, but could sound very different depending on the case. The second component is the acoustic model, which generates mel spectrograms from the processed text and forms the heart of the synthesis process, trained on numerous hours of audio-text pairs. Finally, there’s the vocoder that transforms mel spectrograms into waveforms that we can hear, typically outputting .wav files or sound streams.

None of the existing speech synthesis solutions suited us for several reasons: we’re building a bilingual multi-tonal solution that we want to work autonomously on the bank’s servers. Bilingual means that TTS should synthesize the same voice in both languages in order to be able to communicate with all people in the country — both Uzbek and Russian speaking; even if they switch languages during a conversation, we can adapt and answer accordingly without changing voices. While we’re also dealing with potentially difficult situations, be it support or payment reminder scenarios, the voice should be humane and have a range of tones suited for various cases.

One of our biggest initial challenges while building TTS was the lack of specific open-source datasets for Uzbek needed to make a solution fit for banking with all its vocabulary and situations. This meant collecting and preparing vast amounts of text data from various sources, including internal anonymized (non-sensitive) banking info, financial news, and general content like sports coverage and fairy tales. A proper TTS dataset should also be representative across multiple dimensions: high-frequency words, phoneme combinations, and various other linguistic patterns. We carefully curated our dataset with all these factors in mind.

Our team has also developed several other crucial components from scratch, including a custom normalizer for Uzbek that can handle specific cases like ordinal numbers, converting them into their written unified form. We also created an accentuation system for Uzbek words.

The other challenge was that in order to train a TTS model, we needed many hours of recording Uzbek and Russian speech, spoken by the same person — and like before, this is not only regular speech; it should also include specific banking terms and other financial characteristics. Thankfully, having already collected the needed amounts of text for a well-balanced dataset, we found a wonderful young woman who is an extremely professional voice actor speaking both Uzbek and Russian excellently.

Work in progress

As we’ve mentioned before, just one voice tone simply doesn’t cut it when dealing with the variety of situations in banking. A friendly reminder about an upcoming payment and an urgent notice require completely different approaches. The tone of voice (TOV) can make or break these interactions, potentially influencing the customer’s response and the conversation’s outcome. That’s why we’ve developed three distinct voice tones:

Friendly: for support and routine interactions
Neutral: for standard communications
Strict: for urgent matters

Behind the scenes, this meant our voice talent had to record extensive amounts of text in these different emotional states and in two languages. We didn’t just wing it — the heads of TBC Uzbekistan’s respective departments carefully supervised the recording process to ensure each tone perfectly matched their communication needs and didn’t get mixed with another.

From an ML perspective, our approach was methodical:

1. First, we developed a base model using our complete voice dataset

2. Then, we fine-tuned specific emotional variations using carefully curated subsets

3. Finally, we excluded inappropriate tones based on the intended use case

We added an extra layer of sophistication by incorporating tongue twisters into our training data. These traditional sayings and proverbs contain words with unique phoneme combinations that rarely appear in everyday speech. This proved to be a clever insurance policy — if our model ever encounters these unusual letter combinations in the wild, it already knows how to pronounce them correctly.

The results? A versatile TTS system that speaks the language of banking in all the right tones, capable of handling even the most unusual pronunciations.

Few examples of Uzbek and Russian synthesis of a phrase “AI will make our lives better”:

<iframe src="https://soundcloud.com/somebody-74110847/synthesis-example?utm_source=clipboard&utm_campaign=wtshare&utm_medium=widget&utm_content=https%253A%252F%252Fsoundcloud.com%252Fsomebody-74110847%252Fsynthesis-example></iframe>

<iframe src=https://soundcloud.com/somebody-74110847/synthesis-example-rus?utm_source=clipboard&utm_campaign=wtshare&utm_medium=widget&utm_content=https%253A%252F%252Fsoundcloud.com%252Fsomebody-74110847%252Fsynthesis-example-rus></iframe>

Working with a professional voice actor in a studio setting brought its own set of interesting challenges. Ironically, her high level of professionalism sometimes worked against us — we occasionally needed to ask her to speak more naturally rather than using her perfect professional diction. So we also needed to monitor the voice to ensure it doesn’t strain and that intonations don’t change due to fatigue.

Additionally, any recording process for speech synthesis purposes is teamwork between the sound engineer and the voice talent. In our case, the sound engineer not only monitors the quality of the recording process but also checks the text’s correctness. But not everything is quite easy — not in our case! Going back to the three-alphabet problem: it turned out that the engineer couldn’t read Cyrillic but understood Latin script. The voice talent, on the contrary, could read Cyrillic perfectly but couldn’t read Latin script at all. This meant we had to prepare two versions of the texts for recording — one in Cyrillic and one in Latin script, so everyone could understand and monitor potential errors. The result was 20 clean hours of recorded text without pauses, which meant more than 100 hours in the studio.

<iframe width="560" height="315" src="https://www.youtube.com/embed/6ySusQITaT0?si=KptkzrLLmVEYmWd5" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

Just like department heads supervised our voice recordings, we needed people who truly understood the nuances of customer communication to evaluate our resulting TTS models. External assessors wouldn’t cut it — they simply wouldn’t know how support agents actually talk to customers.

Our solution? We built our own assessment system from scratch, recruiting call center veterans who lived and breathed customer service every day. These pros rated each model iteration on a 1–5 scale, focusing on natural flow and appropriate tone of voice. As a side note, talking about new professions created by AI: in our opinion, some of these people are now perfect voice coaches and, if they wish, they can work in this booming industry instead of their old jobs in support.

While we’ve achieved significant success with the acoustic model, we’re currently working on fine-tuning an open-source vocoder to better match our speaker’s characteristics. The substantial dataset we’ve created for a single-speaker bilingual system provides an excellent foundation for future improvements.

***

After half a year of dedicated collaboration, our joint efforts with TBC Uzbekistan yielded something remarkable — together we brought to life what’s likely the most advanced speech tech solution for a country of almost 40 million people. The solution was shaped not just by our technical expertise, but also by TBC’s deep understanding of their customers’ needs and their exceptional readiness to innovate and implement new features. Their ability to quickly test, validate, and roll out new capabilities contributed heavily to making it possible to achieve in months what typically takes years.From a business perspective, the partnership has already proven transformative: as of early 2025, AI-powered agents handled over 40% of call reminders for early-stage delinquency loans, delivering efficiency gains while maintaining a human-like conversational experience that most customers couldn’t distinguish from real people. This innovative implementation of AI has significantly boosted operational efficiency, being 10 times more efficient than humans. Encouraged by these results and maintaining their characteristic momentum, TBC is taking an active role in expanding this technology across other business functions, continuously working with us to identify new opportunities and use cases.What started as a solution for Uzbekistan revealed a much bigger opportunity.

We discovered that many regions face similar challenges — places where global speech recognition solutions fall short of local needs. Even Spanish-speaking countries, despite having access to major providers like Google, struggle with regional variations; Argentine Spanish, for instance, differs significantly from the standard Spanish these solutions support.We believe that our methodology has become a product in itself, enabling organizations to build locally-optimized speech technology that truly understands their customers. We’re offering proven expertise in helping companies develop custom solutions using their own data. This foundation then unlocks the full potential of our platform, allowing companies to deploy AI agents across various parts of the business.Stay tuned for the equally interesting next part of our story — the building of an LLM tailor-made for banking needs and making it work in pair with our ASR and TTS.

Technical Team

UK, Cyprus, Portugal, UAE

info@aiphoria.ai