&gt; Interesting that there isn&#x27;t a mention of Orpheus as prior art eitherLlasa-3b (<a href="https:&#x2F;&#x2F;huggingface.co&#x2F;HKUSTAudio&#x2F;Llasa-3B" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;HKUSTAudio&#x2F;Llasa-3B</a>) came out before Orpheus (<a href="https:&#x2F;&#x2F;huggingface.co&#x2F;canopylabs&#x2F;orpheus-3b-0.1-ft" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;canopylabs&#x2F;orpheus-3b-0.1-ft</a>).&gt; it&#x27;s the exact same thing.They&#x27;re very similar, but they&#x27;re not the exact same thing.Llasa uses xcodec2, a much simpler, lossless 16khz wav codec. This makes it superior for one-shot voice cloning.Orpheus&#x27; 24khz snac codec is lossy which makes it difficult to use for zero-shot cloning as the reference audio gets degraded during tokenization. You can test this here:
<a href="https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;Gapeleon&#x2F;snac_test" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;Gapeleon&#x2F;snac_test</a>But when finetuned on 50+ audio samples, it produces much cleaner 24khz audio than Llasa, and the snac model is much easier to run on consumer hardware than xcodec2 (87t&#x2F;s for realtime speech, which can be achieved on an RTX3080 for example)

Interesting that there isn&#x27;t a mention of Orpheus as prior art either since it&#x27;s the exact same thing.(<a href="https:&#x2F;&#x2F;github.com&#x2F;canopyai&#x2F;Orpheus-TTS">https:&#x2F;&#x2F;github.com&#x2F;canopyai&#x2F;Orpheus-TTS</a>)

Odd that the page doesn&#x27;t seem to link to either,paper: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2502.04128" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2502.04128</a>github: <a href="https:&#x2F;&#x2F;github.com&#x2F;zhenye234&#x2F;LLaSA_training">https:&#x2F;&#x2F;github.com&#x2F;zhenye234&#x2F;LLaSA_training</a>

Probably the title should have the correct capitalization then. Cause I was fully expecting a speech synthesis tool that sounded like llamas talking human language and now I&#x27;m bummed out!

LLaSA is a simple framework for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as LLaMA.

You can run an openai-compatible endpoint and point open-webui at it if you want this. I had to add a function to filter out markdown lists, code, etc as the model was choking on them.

I can&#x27;t wait see this integrated into Open WebUI! These sound amazing.

This finetune seems pretty stable (1b llasa) <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;HKUST-Audio&#x2F;Llasa-1B-multi-speakers-genshin-zh-en-ja-ko" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;HKUST-Audio&#x2F;Llasa-1B-multi-spe...</a>1B is actually huge for a TTS model. Here&#x27;s an 82m model with probably the most stable&#x2F;coherent output of all the open weights tts models I&#x27;ve tested: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;hexgrad&#x2F;Kokoro-TTS" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;hexgrad&#x2F;Kokoro-TTS</a>But if you mean zero-shot cloning, yeah they all seem to have those slurred speech artefacts from time to time.

the mispronunciation of 行 and 行 in the Chinese sample is killing me too XD

If you&#x27;re doing a home lab voice assistant 1B is nice, because on a 12gb gpu you can run a moderately competent 7b LLM and two 1b models; 1 for speech to text and also text to speech, plus some for the wake word monitor. Maybe in a couple of years we can combine all this into a single ~8b model that runs efficiently on 12gb gpu. Nvidia doesn&#x27;t seem very incentivized right now to sell consumer GPUs that can run all this on a single consumer grade chip when they&#x27;re making so much money selling commercial grade 48gb cards.

based on the samples, it really seams like anything smaller than 3B is pretty useless.

the long &#x27;uuuuhhhhhhh&#x27; from some of the lesser models is killing me.

This already exists in Transformer Lab and ONNX (not recommended for transformers).You can also build a custom version of llama.cpp that writes out the ggml compute graph. What&#x27;s irritating is that hugging face didn&#x27;t add it to their GGUF file viewer.

Sounds like a solid SaaS business plan!

&gt; employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully alignI really wish when new models were released that they would draw a diagram of all the layers and the tensor input and output sizes at each layer, with zoom in&#x2F;out capabilities if needed using D3.js or whatever visualization framework if needed. Every single layer should be on there with its input and output sizes.These one-sentence descriptions, and approximate block diagrams with arrows pointing at each other are never enough to understand how something is actually implemented.

Llasa: Llama-Based Speech Synthesis