Latest generative AI models such as OpenAI's ChatGPT-4 and Google's Gemini 2.5 require not only high memory bandwidth but also large memory capacity. This is why generative AI cloud operating companies like Microsoft and Google purchase hundreds of thousands of NVIDIA GPUs. As a solution to address the core challenges of building such high-performance AI infrastructure, Korean researchers have succeeded in developing an NPU (Neural Processing Unit)* core technology that improves the inference performance of generative AI models by an average of over 60% while consuming approximately 44% less power compared to the latest GPUs.
*NPU (Neural Processing Unit): An AI-specific semiconductor chip designed to rapidly process artificial neural networks.
On the 4th, Professor Jongse Park's research team from KAIST School of Computing, in collaboration with HyperAccel Inc. (a startup founded by Professor Joo-Young Kim from the School of Electrical Engineering), announced that they have developed a high-performance, low-power NPU (Neural Processing Unit) core technology specialized for generative AI clouds like ChatGPT.
The technology proposed by the research team has been accepted by the '2025 International Symposium on Computer Architecture (ISCA 2025)', a top-tier international conference in the field of computer architecture.
The key objective of this research is to improve the performance of large-scale generative AI services by lightweighting the inference process, while minimizing accuracy loss and solving memory bottleneck issues. This research is highly recognized for its integrated design of AI semiconductors and AI system software, which are key components of AI infrastructure.
While existing GPU-based AI infrastructure requires multiple GPU devices to meet high bandwidth and capacity demands, this technology enables the configuration of the same level of AI infrastructure using fewer NPU devices through KV cache quantization*. KV cache accounts for most of the memory usage, thereby its quantization significantly reduces the cost of building generative AI clouds.
*KV Cache (Key-Value Cache) Quantization: Refers to reducing the data size in a type of temporary storage space used to improve performance when operating generative AI models (e.g., converting a 16-bit number to a 4-bit number reduces data size by 1/4).
The research team designed it to be integrated with memory interfaces without changing the operational logic of existing NPU architectures. This hardware architecture not only implements the proposed quantization algorithm but also adopts page-level memory management techniques* for efficient utilization of limited memory bandwidth and capacity, and introduces new encoding technique optimized for quantized KV cache.
*Page-level memory management technique: Virtualizes memory addresses, as the CPU does, to allow consistent access within the NPU.
Furthermore, when building an NPU-based AI cloud with superior cost and power efficiency compared to the latest GPUs, the high-performance, low-power nature of NPUs is expected to significantly reduce operating costs.
Professor Jongse Park stated, "This research, through joint work with HyperAccel Inc., found a solution in generative AI inference lightweighting algorithms and succeeded in developing a core NPU technology that can solve the 'memory problem.' Through this technology, we implemented an NPU with over 60% improved performance compared to the latest GPUs by combining quantization techniques that reduce memory requirements while maintaining inference accuracy, and hardware designs optimized for this".
He further emphasized, "This technology has demonstrated the possibility of implementing high-performance, low-power infrastructure specialized for generative AI, and is expected to play a key role not only in AI cloud data centers but also in the AI transformation (AX) environment represented by dynamic, executable AI such as 'Agentic AI'."
This research was presented by Ph.D. student Minsu Kim and Dr. Seongmin Hong from HyperAccel Inc. as co-first authors at the '2025 International Symposium on Computer Architecture (ISCA)' held in Tokyo, Japan, from June 21 to June 25. ISCA, a globally renowned academic conference, received 570 paper submissions this year, with only 127 papers accepted (an acceptance rate of 22.7%).
※Paper Title: Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
※DOI: https://doi.org/10.1145/3695053.3731019
Meanwhile, this research was supported by the National Research Foundation of Korea's Excellent Young Researcher Program, the Institute for Information & Communications Technology Planning & Evaluation (IITP), and the AI Semiconductor Graduate School Support Project.
Professor Moon-Jeong Choi from KAIST’s Graduate School of Science and Technology Policy has been appointed as an advisor for "Innovate for Impact" at the AI for Good Global Summit, organized by the International Telecommunication Union (ITU), a specialized agency of the United Nations (UN). The ITU is the UN's oldest specialized agency in the field of information and communication technology (ICT) and serves as a crucial body for coordinating global ICT policies and standards. This advis
2025-07-08<(From Left)Prof. Yong Man Ro and Ph.D. candidate Sejin Park> Se Jin Park, a researcher from Professor Yong Man Ro’s team at KAIST, has announced 'SpeechSSM', a spoken language model capable of generating long-duration speech that sounds natural and remains consistent. An efficient processing technique based on linear sequence modeling overcomes the limitations of existing spoken language models, enabling high-quality speech generation without time constraints. It is expe
2025-07-04<From left> President Abdulla Al-Salman(King Saud University), President Kwang Hyung Lee(KAIST) KAIST (President Kwang Hyung Lee) and King Saud University (President Abdulla Al-Salman) held a meeting on July 3 at the KAIST Campus in Seoul and agreed to pursue strategic cooperation in AI and digital platform development. The global AI landscape is increasingly polarized between closed models developed by the U.S. and China’s nationally focused technology ecosystems. In this context
2025-07-04<(From the Right) Professor Ho Jin Ryu, Department of Nuclear and Quantum Engineering, Dr. Sujeong Lee, a graduate of the KAIST Department of Materials Science and Engineering, and Dr. Juhwan Noh of KRICT’s Digital Chemistry Research Center> Managing radioactive waste is one of the core challenges in the use of nuclear energy. In particular, radioactive iodine poses serious environmental and health risks due to its long half-life (15.7 million years in the case of I-129), hi
2025-07-03< A group photo taken at the 2025 GESS Special Lecture.Vice President So Young Kim from the International Office, VC Jay Eum from GFT Ventures, Professor Byungchae Jin from the Impact MBA Program at the Business School, and Research Assistant Professor Sooa Lee from the Office of Global Initiative> The “2025 KAIST Global Entrepreneurship Summer School (2025 KAIST GESS),” organized by the Office of Global Initiative of the KAIST International Office (Vice President
2025-07-01