Willow Ventures

Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Model with Sub-100 ms Response Latency | Insights by Willow Ventures

Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Model with Sub-100 ms Response Latency | Insights by Willow Ventures

Liquid AI Unveils LFM2-Audio-1.5B: A Breakthrough in Audio-Language Models

Liquid AI has recently launched LFM2-Audio-1.5B, an innovative audio-language foundation model designed to seamlessly understand and generate both speech and text. This model is tailored for low-latency, real-time applications on resource-constrained devices, further enhancing the LFM2 family by integrating audio capabilities while maintaining a compact footprint.

What’s New in LFM2-Audio?

LFM2-Audio elevates the existing 1.2B-parameter LFM2 language model, treating audio and text as integral sequence tokens. It implements a unique disentangled approach for audio representations, using continuous embeddings sourced directly from raw waveform chunks (approximately 80 ms). This design minimizes the artifacts typically associated with discretization while ensuring autoregressive training and generation across both audio and text modalities.

Implementation Highlights

The recently released checkpoint for LFM2-Audio comprises several notable features:

  • Backbone: Built on LFM2 (hybrid convolution + attention) with 1.2B parameters dedicated to language.
  • Audio Encoder: Utilizes FastConformer (approx. 115M parameters).
  • Audio Decoder: Implements RQ-Transformer for predicting discrete Mimi codec tokens across 8 codebooks.
  • Contextual Capabilities: Supports 32,768 tokens with a vocabulary of 65,536 for text and 2049×8 for audio.
  • Precision: Operates on bfloat16 under the LFM Open License v1.0, currently available in English.

Two Generation Modes for Real-Time Applications

LFM2-Audio offers two primary generation modes for diverse applications:

  1. Interleaved Generation
    Optimized for live speech-to-speech chat, this mode allows the model to alternate between text and audio tokens, effectively reducing perceived latency.

  2. Sequential Generation
    Designed for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS), this mode facilitates modality switching on a turn-by-turn basis.

Liquid AI also provides a Python package (liquid-audio) along with a Gradio demo to replicate these functionalities.

Remarkable Low Latency

With an end-to-end latency of under 100 ms — from a 4-second audio query to the first audible response — LFM2-Audio stands out as one of the fastest models in its category. Liquid AI claims it achieves this performance even when compared to smaller models under 1.5B parameters.

Benchmark Performance

Evaluated through VoiceBench, LFM2-Audio achieved an overall score of 56.78 in a series of audio-assistant assessments. Noteworthy per-task outcomes include:

  • AlpacaEval: 3.71
  • CommonEval: 3.49
  • WildVoice: 3.17

These results indicate a competitive edge over larger models such as Qwen2.5-Omni-3B and Moshi-7B.

Why This Matters in Voice AI

One significant advantage of LFM2-Audio is its unified design, which streamlines the connection between ASR, LLM, and TTS. By reducing cumulative latency and offering interleaved decoding for faster audio emission, developers can build simpler, faster response time applications while concurrently supporting multiple functionalities such as classification and conversation.

Conclusion

Liquid AI’s LFM2-Audio-1.5B represents a significant advancement in the realm of audio-language models. Its ability to facilitate low-latency, high-quality communication on resource-limited devices places it at the forefront of the ongoing evolution in voice AI technology.

Related Keywords

  • Audio-Language Models
  • Low Latency AI
  • Automatic Speech Recognition (ASR)
  • Text-to-Speech (TTS)
  • Real-Time Voice Assistants
  • Machine Learning Applications
  • Natural Language Processing (NLP)


Source link