Recently, Spoken Language Models (SLMs) have been spotlighted as next-generation technology that surpasses the limitations of text-based language models by learning human speech without text to understand and generate linguistic and non-linguistic information. However, existing models showed significant limitations in generating long-duration content required for podcasts, audiobooks, and voice assistants. Now, KAIST researcher has succeeded in overcoming these limitations by developing 'SpeechSSM,' which enables consistent and natural speech generation without time constraints.
KAIST(President Kwang Hyung Lee) announced on the 3rd of July that Ph.D. candidate Sejin Park from Professor Yong Man Ro's research team in the School of Electrical Engineering has developed 'SpeechSSM,' a spoken. a spoken language model capable of generating long-duration speech.
This research is set to be presented as an oral paper at ICML (International Conference on Machine Learning) 2025, one of the top machine learning conferences, selected among approximately 1% of all submitted papers. This not only proves outstanding research ability but also serves as an opportunity to once again demonstrate KAIST's world-leading AI research capabilities.
A major advantage of Spoken Language Models (SLMs) is their ability to directly process speech without intermediate text conversion, leveraging the unique acoustic characteristics of human speakers, allowing for the rapid generation of high-quality speech even in large-scale models.
However, existing models faced difficulties in maintaining semantic and speaker consistency for long-duration speech due to increased 'speech token resolution' and memory consumption when capturing very detailed information by breaking down speech into fine fragments.
To solve this problem, Se Jin Park developed 'SpeechSSM,' a spoken language model using a Hybrid State-Space Model, designed to efficiently process and generate long speech sequences.
This model employs a 'hybrid structure' that alternately places 'attention layers' focusing on recent information and 'recurrent layers' that remember the overall narrative flow (long-term context). This allows the story to flow smoothly without losing coherence even when generating speech for a long time. Furthermore, memory usage and computational load do not increase sharply with input length, enabling stable and efficient learning and the generation of long-duration speech.
Additionally, in the speech generation phase, it uses a 'Non-Autoregressive' audio synthesis model (SoundStorm), which rapidly generates multiple parts at once instead of slowly creating one character or one word at a time, enabling the fast generation of high-quality speech.
While existing models typically evaluated short speech models of about 10 seconds, Se Jin Park created new evaluation tasks for speech generation based on their self-built benchmark dataset, 'LibriSpeech-Long,' capable of generating up to 16 minutes of speech.
Compared to PPL (Perplexity), an existing speech model evaluation metric that only indicates grammatical correctness, she proposed new evaluation metrics such as 'SC-L (semantic coherence over time)' to assess content coherence over time, and 'N-MOS-T (naturalness mean opinion score over time)' to evaluate naturalness over time, enabling more effective and precise evaluation.
< external_image >
Through these new evaluations, it was confirmed that speech generated by the SpeechSSM spoken language model consistently featured specific individuals mentioned in the initial prompt, and new characters and events unfolded naturally and contextually consistently, despite long-duration generation. This contrasts sharply with existing models, which tended to easily lose their topic and exhibit repetition during long-duration generation.
< external_image >
PhD candidate Sejin Park explained, "Existing spoken language models had limitations in long-duration generation, so our goal was to develop a spoken language model capable of generating long-duration speech for actual human use." She added, "This research achievement is expected to greatly contribute to various types of voice content creation and voice AI fields like voice assistants, by maintaining consistent content in long contexts and responding more efficiently and quickly in real time than existing methods."
This research, with Se Jin Park as the first author, was conducted in collaboration with Google DeepMind and is scheduled to be presented as an oral presentation at ICML (International Conference on Machine Learning) 2025 on July 16th.
Ph.D. candidate Se Jin Park has demonstrated outstanding research capabilities as a member of Professor Yong Man Ro's MLLM (multimodal large language model) research team, through her work integrating vision, speech, and language. Her achievements include a spotlight paper presentation at 2024 CVPR (Computer Vision and Pattern Recognition) and an Outstanding Paper Award at 2024 ACL (Association for Computational Linguistics).
For more information, you can refer to the publication and accompanying demo: SpeechSSM Publications.
The KAIST community gathered online to celebrate the 2020 graduating class. The blended ceremony conferred their hard-earned degrees on August 28. The belated celebration, which was postponed from February 21 due to the COVID-19 outbreak, honored the 2846 graduates with live streaming on YouTube beginning at 2:00 pm. The graduates include 721 PhDs and 1399 master’s degree holders. The government raised its social distancing guidelines to level two out of three on August 23 as the sec
2020-08-28President Sung-Chul Shin shared the recipe for success for rapid national development through university education during the Island 10-22 Conference held at the Skolov Institute of Science and Technology in Moscow on July 16. President Shin stressed how urgent it is for higher education to rapidly embrace the new global economic environment brought about by the Fourth Industrial Revolution in his keynote address entitled ‘Roles and Responsibilities of Universities for Rapid Nationa
2019-07-18(President Shin delivers his inaugural address at the inauguration ceremony on March 15.) Professor Sung-Chul Shin was officially inaugurated as its 16th president of KAIST on March 15 in a ceremony at the KAIST Auditorium. The celebration began with a procession by dignitaries including the KAIST Board of Trustees Chairman Jang-Moo Lee, the National Academy of Sciences of Korea President Sook-Il Kwun, Daejeon City Mayor Sun-Taik Kwon, National Assemblyman Sangmin Lee, KAIST Alumni Associa
2017-03-15KAIST and the Korea Society for Creativity and Application (KSCA) co-hosted a symposium on creative education on January 21, 2016 at the KAIST Business and Management College in Seoul. Along with the symposium, the two organizations also held the Korea "Theory of Inventive Problem Solving" (TRIZ) Festival 2016. Around 200 experts from academia, industry, and research including Dong-Suk Kim, Dean of the KAIST College of Business and Management and Gui-Chan Park, Director of POSCO Group Academy,
2016-01-19Professor Keon-Jae Lee of KAIST’s Materials Science and Engineering Department delivered a speech at the 2015 Institute of Electrical and Electronics Engineers (IEEE) International Electron Devices Meeting (IEDM) held on December 7-9, 2015 in Washington, D.C. He will also present a speech at the 2016 International Solid-State Circuits Conference scheduled on January 31-February 4, 2016 in San Francisco, California. Both professional gatherings are considered the world’s mos
2015-11-26