KAIST

NEWS

홈페이지 통합검색

-
KOREAN

AI+model

KAIST researcher Se Jin Park develops 'SpeechSSM,' opening up possibilities for a 24-hour AI voice assistant. <(From Left)Prof. Yong Man Ro and Ph.D. candidate Sejin Park> Se Jin Park, a researcher from Professor Yong Man Ro’s team at KAIST, has announced 'SpeechSSM', a spoken language model capable of generating long-duration speech that sounds natural and remains consistent. An efficient processing technique based on linear sequence modeling overcomes the limitations of existing spoken language models, enabling high-quality speech generation without time constraints. It is expected to be widely used in podcasts, audiobooks, and voice assistants due to its ability to generate natural, long-duration speech like humans. Recently, Spoken Language Models (SLMs) have been spotlighted as next-generation technology that surpasses the limitations of text-based language models by learning human speech without text to understand and generate linguistic and non-linguistic information. However, existing models showed significant limitations in generating long-duration content required for podcasts, audiobooks, and voice assistants. Now, KAIST researcher has succeeded in overcoming these limitations by developing 'SpeechSSM,' which enables consistent and natural speech generation without time constraints. KAIST(President Kwang Hyung Lee) announced on the 3rd of July that Ph.D. candidate Sejin Park from Professor Yong Man Ro's research team in the School of Electrical Engineering has developed 'SpeechSSM,' a spoken. a spoken language model capable of generating long-duration speech. This research is set to be presented as an oral paper at ICML (International Conference on Machine Learning) 2025, one of the top machine learning conferences, selected among approximately 1% of all submitted papers. This not only proves outstanding research ability but also serves as an opportunity to once again demonstrate KAIST's world-leading AI research capabilities. A major advantage of Spoken Language Models (SLMs) is their ability to directly process speech without intermediate text conversion, leveraging the unique acoustic characteristics of human speakers, allowing for the rapid generation of high-quality speech even in large-scale models. However, existing models faced difficulties in maintaining semantic and speaker consistency for long-duration speech due to increased 'speech token resolution' and memory consumption when capturing very detailed information by breaking down speech into fine fragments. To solve this problem, Se Jin Park developed 'SpeechSSM,' a spoken language model using a Hybrid State-Space Model, designed to efficiently process and generate long speech sequences. This model employs a 'hybrid structure' that alternately places 'attention layers' focusing on recent information and 'recurrent layers' that remember the overall narrative flow (long-term context). This allows the story to flow smoothly without losing coherence even when generating speech for a long time. Furthermore, memory usage and computational load do not increase sharply with input length, enabling stable and efficient learning and the generation of long-duration speech. SpeechSSM effectively processes unbounded speech sequences by dividing speech data into short, fixed units (windows), processing each unit independently, and then combining them to create long speech. Additionally, in the speech generation phase, it uses a 'Non-Autoregressive' audio synthesis model (SoundStorm), which rapidly generates multiple parts at once instead of slowly creating one character or one word at a time, enabling the fast generation of high-quality speech. While existing models typically evaluated short speech models of about 10 seconds, Se Jin Park created new evaluation tasks for speech generation based on their self-built benchmark dataset, 'LibriSpeech-Long,' capable of generating up to 16 minutes of speech. Compared to PPL (Perplexity), an existing speech model evaluation metric that only indicates grammatical correctness, she proposed new evaluation metrics such as 'SC-L (semantic coherence over time)' to assess content coherence over time, and 'N-MOS-T (naturalness mean opinion score over time)' to evaluate naturalness over time, enabling more effective and precise evaluation. Through these new evaluations, it was confirmed that speech generated by the SpeechSSM spoken language model consistently featured specific individuals mentioned in the initial prompt, and new characters and events unfolded naturally and contextually consistently, despite long-duration generation. This contrasts sharply with existing models, which tended to easily lose their topic and exhibit repetition during long-duration generation. PhD candidate Sejin Park explained, "Existing spoken language models had limitations in long-duration generation, so our goal was to develop a spoken language model capable of generating long-duration speech for actual human use." She added, "This research achievement is expected to greatly contribute to various types of voice content creation and voice AI fields like voice assistants, by maintaining consistent content in long contexts and responding more efficiently and quickly in real time than existing methods." This research, with Se Jin Park as the first author, was conducted in collaboration with Google DeepMind and is scheduled to be presented as an oral presentation at ICML (International Conference on Machine Learning) 2025 on July 16th. Paper Title: Long-Form Speech Generation with Spoken Language Models DOI: 10.48550/arXiv.2412.18603 Ph.D. candidate Se Jin Park has demonstrated outstanding research capabilities as a member of Professor Yong Man Ro's MLLM (multimodal large language model) research team, through her work integrating vision, speech, and language. Her achievements include a spotlight paper presentation at 2024 CVPR (Computer Vision and Pattern Recognition) and an Outstanding Paper Award at 2024 ACL (Association for Computational Linguistics). For more information, you can refer to the publication and accompanying demo: SpeechSSM Publications.
2025.07.04 View 407
King Saud University and KAIST discussed Strategic AI Partnership <From left> President Abdulla Al-Salman(King Saud University), President Kwang Hyung Lee(KAIST) KAIST (President Kwang Hyung Lee) and King Saud University (President Abdulla Al-Salman) held a meeting on July 3 at the KAIST Campus in Seoul and agreed to pursue strategic cooperation in AI and digital platform development. The global AI landscape is increasingly polarized between closed models developed by the U.S. and China’s nationally focused technology ecosystems. In this context, many neutral countries have consistently called for an alternative third model that promotes both technological diversity and open access. President Lee has previously advocated for a "Tripartite Platform Strategy" (三分之計), proposing an international collaboration framework based on open-source principles to be free from binary digital power structures and foster cooperative coexistence. This KAIST-KSU collaboration represents a step toward developing a new, inclusive AI model. The collaboration aims to establish an innovative multilateral framework, especially within the MENA, Japan, Korea, and Southeast Asia, by building an open-source-based AI alliance. Both institutions bring complementary strengths to the table. Saudi Arabia possesses large-scale capital and digital infrastructure, while Korea leads in core AI and semiconductor technologies, applied research, and talent cultivation. Together, the two nations aim to establish a sustainable collaboration model that creates a virtuous cycle of investment, technology, and talent. This initiative is expected to contribute to the development of an open AI platform and promote diversity in the global AI ecosystem. During the meeting, the two sides discussed key areas of future cooperation, including: · Joint development of open-source AI technologies and digital platforms · Launch of a KAIST-KSU dual graduate degree program · Expansion of exchange programs for students, faculty, and researchers · Collaborative research in basic science and STEM disciplines In particular, the two institutions discussed to establish a joint AI research center to co-develop open AI models and explore practical industrial applications. The goal is to broaden access to AI technology and create an inclusive innovation environment for more countries and institutions. President Abdulla Al-Salman stated, "Under Saudi Vision 2030, we are driving innovation in science and technology through new leadership, openness, and strategic investment. This partnership with KAIST will serve as a critical foundation for building a competitive AI ecosystem in the Middle East." President Kwang Hyung Lee emphasized, "By combining Saudi Arabia's leadership, market, and investment capacity with KAIST's technological innovation and the rich talent pools from both countries, we will significantly contribute to diversifying the global AI ecosystem." Both leaders further noted, "Through joint research leading to an independent AI model, our two institutions could establish a new axis beyond the existing US-China digital order—realizing a 'Tripartite AI Strategy' that will propel us into global markets extending far beyond the MENA and ASEAN regions." KAIST and KSU plan to formalize this agreement by signing an MOU in the near future, followed by concrete actions such as launching the joint research institute and global talent development programs. This collaboration was initiated under the Korea Foundation’s Distinguished Guests Invitation Program, overseen by the Ministry of Foreign Affairs, and is expected to grow into a long-term strategic partnership with continued support from KF. About King Saud University (KSU) Founded in 1957, KSU is Saudi Arabia’s first and leading national university. As a top research-oriented institution in the Middle East, it has achieved international recognition in fields such as AI, energy, and biotechnology. It plays a central role in nurturing talent and driving innovation aligned with Saudi Arabia’s Vision 2030, and is expanding global partnerships to further strengthen its research capabilities. About the Korea Foundation (KF) Established in 1991 under the Ministry of Foreign Affairs, the Korea Foundation is a public diplomacy institution dedicated to strengthening international understanding and friendship with Korea. KF plays a key role in expanding Korea’s soft power through academic and cultural exchange, people-to-people networks, and global Korean studies programs. Its Distinguished Guests Invitation Program fosters strategic partnerships with global leaders in government, academia, and industry.
2025.07.04 View 293

KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea T.042-350-2114 F.042-350-2210(2220)