Anton Shestakov
Jun 2, 2025
At Aiphoria, we develop AI agents designed to automate employee work across diverse organizational environments. Our Aiphoria Pro Platform is a next-gen NLP solution that easily integrates AI into any business process — text or voice-based. It adapts to any knowledge domain or language with open architecture and on-premise deployment options. Our GOAL framework (Goal-Oriented Architecture for LLM) ensures an AI employee guides customers toward specific objectives rather than through rigid rule-based scenarios, whether selling products or solving technical issues.
TBC Uzbekistan, the dominant financial player not only in Uzbekistan but across Central Asia, embarked on an ambitious vision: developing proprietary Speech Technology in both Uzbek and Russian languages — arguably the most ambitious AI project in the entire Central Asian region. To realize this vision, TBC Uzbekistan assembled a powerful internal team with specialized expertise and strategically partnered with our company, whose experience and technological capabilities would help bring their ambitious plans to life.
The joint project underscored a critical challenge in AI integration: the limitations of standardized, existing AI solutions in complex, diverse markets. Our work began with developing advanced Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) technologies for the low-resource Uzbek language, which we’ve covered extensively in our previous post. However, it’s not enough to just speak or hear Uzbek — you need to teach your system to understand what people are saying and respond appropriately.
Thankfully, Large Language Models (LLMs) can help, but the challenge persists — Uzbek is still a low-resource language, meaning there are no ready-to-use solutions. This is especially true in banking, our target domain. Banking requires LLMs not just to speak Uzbek but to understand specialized vocabulary. Customers won’t discuss weather or news with a banking AI — they’ll ask about loans, debts, and credits in all their complexity. Banking also requires any LLM solution to work on-premise due to high security and data privacy concerns. Additionally, banking interactions demand speed. Most communications happen by phone, with customers frequently interrupting, mishearing, mixing languages and dialects. For our LLM solution to work effectively in voice applications, it needs to be exceptionally quick.
Once again, we face an interesting technological challenge: creating a tailored AI solution for a culturally nuanced market — this time, leveraging LLMs in true partnership with TBC Uzbekistan’s expertise. The AI team at TBC Uzbekistan collaborated closely with us, sharing their insights on local banking practices and customer communication patterns that significantly enhanced our solution’s effectiveness. Their technical team was also instrumental in the implementation phase, ensuring seamless integration with existing banking systems and security protocols — a critical aspect that can make or break AI adoption in financial institutions.

Image credit: TBC Uzbekistan
So let’s dive deep into how we’ve managed this all.
What we had to keep in mind
One obvious option would be training a model from scratch. However, it would require massive datasets: billions of texts in all required languages, including Wikipedia. On top of that, the computational resources needed for training would be significant to say the least. For context, training a model like GPT-3 is believed to cost millions of dollars. It wouldn’t make much commercial sense to spend such an amount of money on this task. Thankfully, there is always a smarter way.
So, instead of starting from scratch, we decided to fine-tune existing open-source LLMs. As we’ve already explained in our previous post, Uzbekistan is a country where several languages and dialects are spoken. This means that we needed a fully bilingual stack — LLMs that could understand and generate text in both Uzbek and Russian. Additionally, we wanted to deploy the system on-premise because of all the data safety concerns as we’re talking about banking. Our target was models that would also be:
– small enough to run well on our client’s appropriately cost efficient hardware without the need to upgrade it by pouring additional millions into infrastructure
– fast enough to handle quick responses when working in voice modality during phone calls
– powerful enough to handle the complexities of bilingual text generation
All of this makes our target model about 7–9 billion parameters large. Although there is no universal definition of what is a small model and what is a large model, small models are usually below 1 billion parameters while very large models like GPT-4 can reach more than 1 trillion parameters. This makes our target model relatively small.

An example of how LLMs can be categorized
Choosing the Right Model
Nearly all open-source models are released in two versions: pretrained (-pt) and instruction-tuned (-it). Pretrained models have a single capability — predicting the next word based on previous context. They understand that after “the cat sits on the” comes “mat” rather than “mats,” or that after “heavy” comes “rain” not “raining.” Essentially, pt-models learn language structure.
However, most users need models to follow specific instructions: “answer a question,” “tell a joke,” or “choose from the given options.” Pretrained models can’t follow these instructions, so they undergo additional training on instruction datasets to create instruction-tuned (-it) models. In other words, instruction-tuned models evolve from pretrained ones.
Our goal was to create an instruction-following assistant capable of responding in two languages. This presented several strategic choices: should we start with a -pt or -it model? Should we use multiphase training (completion and instruction + alignment) or single-phase (instruction only)? Should we implement full fine-tuning (retraining the entire model or specific layers) or parameter-efficient fine-tuning (like LoRA)? Should we separately train the embedding layer or leave it unchanged?
Different combinations of answers to these questions lead to dramatically different training strategies. For example, starting with a -pt model would require extensive training on our target languages followed by comprehensive instruction training (since it hasn’t previously learned to follow instructions). This approach might deliver superior language understanding, but would demand thousands to tens of thousands of high-quality “instruction-response” pairs. These instructions would need to be diverse, including not only domain-specific banking instructions but also, for instance, coding tasks.
Safety considerations also factored into our decision-making. Instruction-tuned models are typically already trained to safely respond to provocations and inappropriate content requests. Training instructions from scratch with a -pt model would require teaching safety protocols from square one — demanding even more resources.
Alternatively, using an -it model presents different challenges. Training it on completion tasks might cause it to forget some of its instruction-following capabilities. Skipping completion training and proceeding directly to instruction fine-tuning in our target languages risks the model never fully learns language structure.
As for choosing between full fine-tuning and parameter-efficient one (PEFT), full fine-tuning specific layers or the entire model requires more time and data. While the exact figures depend on model size and the number of parameters being retrained, the differences can be substantial. Using PEFT methods like LoRA, however, can limit the model’s ability to adapt to new languages, especially if the base model has limited exposure to the target language or related languages from the same language family.
Each decision represented a trade-off between performance, resource utilization, and development timeline. For a banking application, we couldn’t afford to compromise on accuracy, but we also needed to be practical about computational requirements and development time. The key question became: how could we achieve the best language capabilities and instruction-following behavior with reasonable computational resources and within a practical timeframe?
Experimental Approach
Our journey began with extensive testing of several model families including Meta’s LLaMa-3.x, Google’s Gemma-2, and Alibaba’s Qwen. We evaluated models ranging from 8B to 72B parameters across four languages. Uzbek and Russian were our target languages, while English established our performance benchmarks. We also included Turkish as a high-resource relative of Uzbek to provide context for Turkic language performance.

Relative performance of models on Turkish-translated datasets. The metric used is a weighted sum of metrics across each model. Normalization is performed relative to the best-performing model.
Among models under 10B parameters, Gemma-2–9B demonstrated superior performance, particularly outperforming comparable models like LLaMa-3–8B and LLaMa-3.1–8B in Turkish and Uzbek. This clear advantage made Gemma-2–9B our foundation for further training.
The path to optimal performance wasn’t straightforward. We conducted multiple experiments to determine the best training strategy for our specific needs. We tested vocabulary handling with and without expansion and compared full fine-tuning against more efficient LoRA and QLoRA adapters. Our experiments also explored different training phases, contrasting two-phase approaches (completion followed by instruction) against single-phase instruction-only training.
Additionally, we investigated the impact of instruction diversity by comparing broad instruction sets against domain-specific banking instructions. As we’ve explained before, base model selection was another critical factor, so we evaluated pretrained versions against instruction-tuned variants to determine our starting point.
After extensive testing, we developed a surprisingly effective approach. We started with Gemma-2–9B-IT (the instruction-tuned version) and implemented a two-phase training process. The first phase focused on completion training using mixed Uzbek and Russian content, establishing language fluency. The second phase introduced instruction fine-tuning specifically for domain-specific banking scenarios.
Throughout both phases, we employed Parameter-Efficient Fine-Tuning (PEFT) with QLoRA, which significantly reduced computational requirements while maintaining performance.
Completion Training: Learning a Low-Resource Language
For the completion phase, gathering appropriate data presented unique challenges for each language. Russian, being a high-resource language, offered abundant datasets. Uzbek, however, required more creative approaches. Models trained on purely translated texts tend to adopt the stilted patterns of translation tools rather than natural language flow, so that wasn’t an option.
We combined multiple data sources to create a comprehensive training corpus: publicly available Uzbek datasets from Hugging Face, open English and Russian datasets that we translated to Uzbek, and proprietary data from our banking partner, which included anonymized real conversations from support scenarios. These client-provided datasets were essential to making future interactions sound as humane and professional as possible — real data on pre-train helps to learn domain-specific vocabulary and conversational style. This also saved us a lot of time — without them it would’ve taken much more work to learn to answer smoothly.
Instruction Fine-Tuning: Teaching Command Following
The second phase presented even greater challenges. Despite Gemma being top-tier, neither it nor other instructive models of the necessary caliber can sufficiently handle domain-specific instructions in Uzbek. Which means we needed to create comprehensive instruction-response datasets in two languages — both Uzbek and Russian. Our instruction datasets combined two approaches:
General instructions (like “generate a question based on a statement”, “write a short story”, “rewrite this sentence to make it more formal”), which we primarily translated from existing resources
Domain-specific banking instructions, which we both synthetically generated using larger open models and created with human banking experts from TBC Uzbekistan; we also adapted these instructions according to future popular use cases of our solution, such as support scenarios, which means RAG (Retrieval-Augmented Generation) would be used and instructions should be optimized for it

Distribution of automatic translation quality scores for datasets translated from English to Uzbek
The metric used is a weighted sum of metrics across each model. Normalization is performed relative to the best-performing model.
Translating instructions proved unexpectedly difficult. Many standard LLM training instructions contain language-specific elements that might lose meaning when translated. For example, instructive tasks such as disambiguation resolution lose their sense after translation because homonymy disappears due to translation.
In order to avoid it, rather than relying solely on automated translation we engaged native Uzbek speakers to review our translated instructions. This human oversight revealed that approximately 70% of translated instructions maintained their original meaning in Uzbek and grammar and syntax were also correct, which proved that the underlying idea of automated translation was viable. For the problematic 30%, our language experts helped reformulate instructions to preserve their training value while making them culturally and linguistically appropriate wherever possible. However, not everything could be fixed without losing the correct meaning, so some instructions had to be ditched, which was by itself also a valuable contribution from human experts.
Now to synthetic data of domain-specific instructions, another important part of our fine-tuning process, which also came not without some caveats. Models trained primarily on synthetic data perform well with similar synthetic inputs but struggle with authentic human interactions. Early live testing confirmed this gap — the model performed admirably in simulated conversations but faltered in real-world scenarios.
The thing is, synthetic data often looks too polished and artificial — far from how people actually communicate. To prevent our model from sounding unnatural, we again leveraged human expertise. What we’ve done again was introducing actual customer dialogues from our banking partner not just to our pre-train, but also to our fine-tuning process, carefully anonymized and reviewed by both banking and language experts. Banking professionals ensured the accuracy of financial information, while language specialists maintained conversational naturalness. It’s worth saying that even without real data, we could have managed to fix this “unnaturalness” through post-training alignment, a process of fine-tuning AI models to make their outputs more natural and human-like, essentially teaching the model to communicate more authentically by understanding nuanced conversational contexts. However, that would have required more time and money.
Are we on the right track?
To understand whether our training process was heading in the right direction, we evaluated multiple aspects of our model: instruction comprehension, response generation, factual accuracy, grammar, adherence to Uzbek language, context handling, and safety compliance.
We started with industry-standard English datasets (MMLU, Truthful-QA, Halu-Eval), removing irrelevant domains like medicine, law, and history. We categorized data based on input length (short questions vs. long instructions with context) and output length requirements (from single words to hundreds of words). All language-specific samples were removed and translated into Uzbek, with special techniques applied to reduce translation noise.
Testing the model presented significant challenges. We needed to evaluate numerous aspects simultaneously, and complete test sets contained thousands or tens of thousands of examples. Running full evaluations could consume an entire day of processing time. To address this, we implemented strategic random stratified sampling from different test sets, creating a manageable evaluation corpus of approximately 2,000 samples (excluding safety evaluations).
To validate our approach, we assessed the correlation between our aggregated metrics and established Elo scores from LM-Arena using known English-language models. Finding a statistically significant (p-value < 0.01) positive correlation confirmed our methodology was sound. Our working assumption was that if this metric approach proved effective for English models, it would, with appropriate adjustments, work similarly for our Uzbek language models.
Field Test: A Bilingual Banking Assistant in Telegram
Practice makes perfect, and training is nothing without feedback collection, so even after excessive fine-tuning we needed some place to test it all on real users before any kind of full scale product shipping. In other words, we needed a functional prototype into testing as quickly as possible to establish a managed development cycle (deployment → feedback collection → refinement → redeployment) with short iterations. Without an early prototype, we risked making insufficient improvements.

Expected relative increase in prompt injection response relevance. Comparison between a fine-tuned model, a fine-tuned + alignment model, and a reference model (GPT-4o).
Telegram proved to be an ideal platform for rapid chatbot prototyping. Its asynchronous API, support for integrating various models and scenarios, availability across web, desktop, iOS/Android platforms, and most importantly its widespread popularity in Uzbekistan, made it the clear choice for our needs.
After weeks of fine-tuning and testing, we successfully deployed our bilingual LLM as a Telegram bot for our banking client. The bot effectively handled inquiries in both Uzbek and Russian, guiding users (bank employers) through banking processes like loan applications and account management, and is now ready for implementation in other banking products.
A significant advantage of this approach was the ability to involve native Uzbek speakers from the client’s side in testing, despite our development team lacking native speakers. This collaboration quickly revealed some initial issues — for instance, early versions of the LLM sometimes confused numerals. We identified and fixed these problems promptly thanks to the prototype.
One of the most interesting outcomes was the model’s ability to handle language-switching between Uzbek written in Cyrillic and Latin scripts. For example, a user might type in Uzbek using Cyrillic, and the bot would respond in Uzbek using Latin script, with the conversation continuing seamlessly. This unexpected capability demonstrated the model’s adaptability and enhanced its practical value for users.
Lessons Learned
1. The Challenge of Low-Resource Languages
Uzbek is a low-resource language, and adapting LLMs for such languages requires creativity. We had to rely on a mix of automated translation, native texts, and synthetic data to achieve desired results.
2. The Importance of Real-World Data
Synthetic data is a great starting point, but real-world interactions are crucial for fine-tuning. Without them, the model may struggle to handle the nuances of human conversation.
3. Nothing beats real
Whenever possible, quickly implement things to any safe, but real environment (like a telegram bot) to gather as much feedback as possible — nothing beats invaluable real user experience.
4. The simpler the better
The simplest approaches, which involved training the fewest parameters possible, worked best for us. We found that the more parameters we tried to fine-tune, the more pronounced the negative effects of noise in the data became. This reinforced the idea that, especially in narrow-domain tasks, less can often be more — keeping the model focused and minimizing unnecessary adjustments led to better performance and more reliable results.
…and just in case you’re a business person reading this blog here’s one more bonus lesson learned:
5. Regional Leadership Demands Technological Sovereignty
TBC Uzbekistan’s approach offers a valuable insight for businesses across emerging markets: true market leadership requires technological sovereignty. By investing in proprietary AI solutions tailored to local languages and cultural contexts rather than relying on generic global solutions, TBC established competitive advantages that cannot be easily replicated. The commitment to building internal AI expertise while strategically leveraging external partnerships demonstrates how companies can maintain independence while accelerating innovation.
Business implementations and what’s next
Together with TBC Uzbekistan we implemented trained LLMs in Sales Pro — a “digital employee” that communicates with potential customers and offers them bank products, such as loans. This implementation fully leveraged both our advanced language models and the GOAL framework as the controlling architecture.
When selling, the bot’s task isn’t just to convey information to the customer, but to engage them, discuss their needs, articulate how a loan can address those needs, handle objections, and gently guide the customer toward completing an application. Thanks to TBC Uzbekistan’s substantial investment in AI capabilities and the collaborative effort with Aiphoria providing methodology and platform support, this ambitious region-wide project achieved exceptional results.
As a result, TBC Uzbekistan opened a new sales channel through a Telegram bot. After just one month of operation, it was already showing impressive results — conversion rates to banking product applications (such as loans) were one-third higher, and the acquired users were twice as engagedcompared to other acquisition channels, including those requiring human involvement (like telemarketing). The difference is that in this case no staff is required and the automated product is just as good and even better at convincing people to buy things.
The next goal in TBC Uzbekistan’s AI roadmap is cold calling. This traditionally requires significant employee time and energy, particularly when reaching out to completely unfamiliar prospects. The “digital salesperson” complements human efforts by handling initial outreach without experiencing fatigue from rejections, maintaining consistent communication quality throughout all interactions. This approach allows for successful customer conversions even with challenging audiences while reducing costs and freeing human staff to focus on more complex, relationship-building aspects of sales that benefit from a personal touch.
***
In the next story we’ll cover the process behind making LLM answers as fast as possible, which is crucial in phone call cases. There’s a story to be told of adapting both LLM and ASR/TTS parts of our system to make it so. Stay tuned!
Written by: Aiphoria NLP team (Ivan Pastukhov, Nikita Pustovoytov, Anton Voronov, Denis Kirianov)
Technical Team