The SKT consortium has successfully passed the first phase evaluation of the “Sovereign AI Foundation Model Project” and has commenced Phase 2 development. Phase 2 focuses on evolving the large-scale AI model “A.X K1” into an omni-modal system by integrating multi-modal capabilities such as image and voice processing in stages.
Professor Kim Gun-hee’s research team at Seoul National University has been a consistent collaborator with SKT on this multi-modal research. In this contribution, Professor Kim explores the evolutionary trajectory and significance of A.X K1 from both multi-modal and omni-modal perspectives.

On the 15th, SKT consortium successfully passed Phase 1 evaluation of the government’s “Sovereign AI Foundation Model Project” and secured its spot in Phase 2 of the project. During the first-stage presentation held on December 30th of last year, the SKT elite team garnered significant attention by unveiling “A.X K1,” Korea’s first hyperscale AI model with 500 billion parameters, lauded for its exceptional reasoning performance and multilingual comprehension.
Securing such a powerful large-scale model is a critical milestone in AI development as it enables the rapid and robust development of specialized small-to-medium models across various domains through knowledge distillation.
From Large Language Model to Omni-modal AI
In less than five years since their debut, Large Language Models (LLMs) have become an indispensable technology integrated into daily life. From industry professionals and the general public to children, using language models like ChatGPT or Gemini has become second nature. These models have expanded into multi-modal systems that integrally understand various data formats such as text, image, and video and have recently evolved into omni-modal models capable of perceiving audio as well.
As the Latin prefix “omnis” (meaning “all”) suggests, an omni-modal model refers to an AI that understands all data formats. The term gained mainstream traction in March 2024 when OpenAI named its new model GPT-4o (“o” for “omni”). While a literal interpretation would imply understanding every type of information, the industry typically distinguishes “multi-modal” as models that handle text and visual data such as image and video, while “omni-modal” encompasses audio as well.
Audio has recently received significant attention in academia and industry as a more intuitive and rapid means of communicating with AI compared to text. However, integrating audio involves technical challenges that go far beyond simply adding a new input format.
First, text-based dialogue is a turn-based, sequential exchange while a voice conversation is a simultaneous and bidirectional one. In other words, in a text environment, the conversation is structured so that when the user inputs a prompt, the AI responds, and the user continues speaking after checking the response. In a voice dialogue, one can interrupt while the other is speaking and can give immediate feedback with short reactions such as “um,” “no,” or “right.” These short reactions are called “backchannels,” and accordingly, the conversation partner adjusts the flow and content of the conversation in real-time.
Also, there is a difficulty in that AI models must generate dialogue in an audio-friendly manner. For example, ChatGPT explains a user’s question by dividing it into various items in a bulleted format, but in a voice dialogue, an overly long response causes the user to lose attention rapidly, so the challenge lies in generating responses as concisely as possible while retaining the core substance of the query.
Finally, in a voice dialogue, the AI model must appropriately reflect various user instructions specialized for audio. For instance, they might have to reflect specific user requirements regarding tone, such as speaking with emotion, speaking as if singing, or adopting a child-like persona. Even for the same content, the AI model must consider complex expression methods to generate a voice response suitable for the situation.
Such models that started from existing language models and were specialized for voice dialogue are sometimes distinguished as “spoken language models.” In the early stages, these models relied on a cascade approach where an audio-to-text recognizer was placed at the front end of an existing language model capable of text input and output, and a text-to-audio synthesizer was combined at the back end. However, this method required a process of converting audio to text and then re-generating it as audio, which caused latency and had limits in creating a natural conversation flow. Furthermore, inherent audio information (breathing patterns, emotions, volume of voice, and speed of audio, etc.) was lost in the process of converting from audio to text.
To overcome these limitations, technology is evolving so that a single integrated language model can process even audio information. An example is the OmniVinci model released as open source by NVIDIA. This model uses a language model as a backbone and proposed several ideas so that information of various modalities such as audio, text, and images can align well with each other in a common semantic space. Consequently, recent omni-modal models are developed by placing a powerful pre-trained language model at their core and fine-tuning it with various multi-modal data including audio, so obtaining a high-performance language model becomes the key factor for the success of omni-modal model development.
An Omni-modal Strategy Linking Language, Audio, and the Field
The SKT consortium’s hyperscale AI model “A.X K1” will continue its evolution into an omni-modal model, becoming a core foundation in building an “AI for Everyone” service. SKT can apply this model to the A. (“A-dot”) service, which has over ten million subscribers, to support real-time voice dialogue for various services in daily lives such as TMAP, B tv and A. call summaries. Furthermore, it is expected to be utilized as a core technology for Krafton’s game AI, implementing a new play experience where multiple users perform a common mission through voice dialogue within the game. Additionally, 42dot is expected to be able to enhance mobility AI based on the omni-modal model, improving the driving experience for both drivers and passengers.
For sovereign AI to succeed, Korea needs be able to fully utilize the nation’s core data sovereignty. Most national, public, and industrial data are unstructured data in various formats, and with omni-modal models, such data can be directly learned and operated without depending on external platforms. Furthermore, omni-modal models can connect entire industries into a single model, and in the long term, will evolve into an actionable AI that encompasses the physical world. In other words, securing a successful omni-modal model can be the foundation for strengthening physical infrastructure sovereignty along with digital sovereignty.
The future of South Korea’s sovereign AI is much anticipated, led by A.X K1, based on its successful evolution into an omni-modal model.
