Heh, funny you mention that, considering Be's pivot to BeIA. Some Be engineers also worked on the (unreleased) Palm OS Cobalt, and eventually, Android. (And then Fuschia, but I don't think that OS will ever hit smartphones.)
Haiku does support WLAN adapters (even USB ones), through the support isn't as extensive as Linux or the BSDs. You might want to use the current nightly builds instead of the latest beta version, though, which was released in December 2022.
Does this do RAG over the character's chat history too? That's something SillyTavern can also do with extensions, but I figured since your project already uses Llamaindex, this feature can be something that's already baked in from the get-go.
Yep it can do CoT for ongoing conversations or to get to the bottom of something through back-and-forth. And you nailed it regarding llamaindex, they provide framework options: https://docs.llamaindex.ai/en/latest/examples/chat_engine/ch... (perfect for HN with the Paul Graham example!)
They even dabble in custom personalities with prompt mixins (example: You can chat with a PDF that will respond like Shakespeare), and if this part was more robust I would delegate to it instead of what I created with ragdoll's prompt prefixes. Turns out the hard part is not converting third-person to first-person. For ragdoll, the heavy lifting is more in the configuration and management of different personas, its multi-modality (of models), the Node & React libraries so that developers can use them in realistic applications... where the value llamaindex brings is its incredible indexing capabilities combined with a conversational query engine (why I chose llamaindex over langchain for this). Ragdoll picks up where llamaindex leaves off regarding personas.
I love that SillyTavern says on their GitHub README: "On its own Tavern is useless, as it's just a user interface. You have to have access to an AI system backend that can act as the roleplay character." I want to avoid being a thin wrapper, and instead have that roleplay character aspect be central to what ragdoll does, so that it can be the de facto creative studio for any character-focused creative deliverable: A story, a film, music, games - so that a user can literally create films and music (and more) in this app like some kind of super Photoshop. I think to accomplish that, it cannot simply be a thin wrapper around an open model. It has to bring as much to the table as an ultra fine-tuned model would yet in seconds instead of years, and with the app- and community-level functionality needed (including being a free-to-use creator tool) to get people to actually build things with it.
Not yet haha but even as a place to hang out and casually chat, it would be cool if the character occasionally rendered a cutscene to go along with narratives, or you could optionally enable music and sfx like an audiobook. Maybe the most interesting ones you could export (and distribute for others to experience).
Though I bet the transition from AI text chat to rich multimedia will be like silent films to talkies - where some characters just aren't as interesting with a voiceover or depicted in a video. For some types of characters (written storytellers, etc.) the best interactions might always be text-based.
I felt this with the Final Fantasy 7 Remake, though it's clearly improved from the 1997 version, something felt lost in the transition from the old pre-rendered scenes (drawings) and having to read the dialog in your head, to now having high-quality voiceovers in the best 3D scenes. Yet, if you take a Metal Gear Solid or a Madden - the richer the experience the better.
Ideally: You start out just wanting to go to the tavern and chat with a group of characters, but that interaction became so unexpectedly rich and entertaining you want to capture it, so you can watch it again or share it.
This is HN, so I'm surprised that no one in the comments section has run this locally. :)
Following the instructions in their repo (and moving the checkpoints/ and resources/ folder into the "nested" openvoice subfolder), I managed to get the Gradio demo running. Simple enough.
It appears to be quicker than XTTS2 on my machine (RTX 3090), and utilizes approximately 1.5GB of VRAM. The Gradio demo is limited to 200 characters, perhaps for resource usage concerns, but it seems to run at around 8x realtime (8 seconds of speech for about 1 second of processing time.)
EDIT: patched the Gradio demo for longer text; it's way faster than that. One minute of speech only took ~4 seconds to render. Default voice sample, reading this very comment: https://voca.ro/18JIHDs4vI1v
I had to write out acronyms -- XTTS2 to "ex tee tee ess two", for example.
The voice clarity is better than XTTS2, too, but the speech can sound a bit stilted and, well, robotic/TTS-esque compared to it. The cloning consistency is definitely a step above XTTS2 in my experience -- XTTS2 would sometimes have random pitch shifts or plosives/babble in the middle of speech.
I am trying to run it locally but it doesn't quite work for me.
I was able to run the demos allright, but when trying to use another reference speaker (in demo_part1), the result doesn't sound at all like the source (it's just a random male voice).
I'm also trying to produce French output, using a reference audio file in French for the base speaker, and a text in French. This triggers an error in api.py line 75 that the source language is not accepted.
Indeed, in api.py line 45 the only two source languages allowed are English and Chineese; simply adding French to language_marks in api.py line 43 avoids errors but produces a weird/unintelligible result with a super heavy English accent and pronunciation.
I guess one would need to generate source_se again, and probably mess with config.json and checkpoint.pth as well, but I could not find instructions on how to do this...?
Edit -- tried again on https://app.myshell.ai/ The result sounds French alright, but still nothing like the original reference. It would be absolutely impossible to confuse one with the other, even for someone who didn't know the person very well.
I played with it some more and I have to agree. For actual voice _cloning_, XTTS2 sounds much, much closer to the original speaker. But the resulting output is also much more unpredictable and sometimes downright glitchy compared to OpenVoice. XTTS2 also tries to "act out" the implied emotion/tone/pitch/cadence in the input text, for better or worse.
But my use case is just to have a nice-sounding local TTS engine, and current text-to-phoneme conversion quirks aside, OpenVoice seems promising. It's fast, too.
> but when trying to use another reference speaker (in demo_part1), the result doesn’t sound at all like the source
I’ve noticed the same thing and I wonder if there is maybe some undocumented information about what makes a good voice sample for cloning, perhaps in terms of what you might call “phonemic inventory”. The reference sample seems really dense.
> Indeed, in api.py line 45 the only two source languages allowed are English and Chinese
If you look at the code, outside of what the model does it relies on the surrounding infrastructure converting the input text to the international phonetic alphabet (IPA) as part of the process, and only has that implemented for English and Mandarin (though cleaners.py has broken references to routines for Japanese and Korean.
Give https://github.com/aedocw/epub2tts a look, the latest update enables use of MS Edge cloud-based TTS so you don't need a local GPU and the quality is excellent.
I want to try chaining XTTS2 with something like RVCProject. The idea is to generate the speech in one step, then clone a voice in the audio domain in a second step.
I have got to build or buy a new computer capable of playing with all this cool shit. I built my last "gaming" PC in 2016, so its hardware isn't really ideal for AI shenanigans, and my Macbook for work is an increasingly crusty 2019 model, so that's out too.
Yeah, I could rent time on a server, but that's not as cool as just having a box in my house that I could use to play with local models. Feels like I'm missing a wave of fun stuff to experiment with, but hardware is expensive!
> its hardware isn't really ideal for AI shenanigans
FWIW, I was in the same boat as you and decided to start cheap, old game machines can handle AI shenanigans just fine wirh the right GPU. I use a 2017 workstation (Zen1) and an Nvidia P40 from around the same time, which can be had for <$200 on ebay/Amazon. The P40 has 24GB VRAM, which is more than enough for a good chunk of quantized LLMs or diffusion models, and is in the same perf ballpark as the free Colab tensor hardware.
If you're just dipping your toes without committing, I'd recommend that route. The P40 is a data center card and expects higher airflow than desktop GPUs, so you probably have to buy a "blow kit" or 3D-print a fan shroud and ensure they fit inside your case. This will be another $30-$50. The bigger the fan, the quieter it can run. If you already have a high-end gamer PC/workstation from 2016, you can dive into local AI for $250 all-in.
Edit: didn't realize how cheap P40s now are! I bought mine a while back.
Mac Studio or macbook pro if you want to run the larger models. Otherwise just a gaming pc with an rtx 4090 or a used rtx 3090 if you want something cheaper. A used dual 3090 can also be a good deal, but that is more in the build it yourself category than off the shelf.
I went the 4090 route myself recently, and I feel like all should be warned - memory is a major bottleneck. For a lot of tasks, folks may get more mileage out of multiple 3090s if they can get them set up to run parallel.
Still waiting on being able to afford the next 4090 + egpu case et al. There are a lot of things this rig struggles with running OOM, even on inference with some of the more recent SD models.
Sorry if this is a silly question - I was never a Mac user, but I quick googled Mac Studio and it seems it's just the computer. Can I plug it to any monitor / use any keyboard and mouse, or do I need to use everything from Apple with it?
You can, but with some caveats. Not all screen resolutions work well with MacOS, though using BetterDisplay it will still usually work. If you want touch id, it's better to get the Magic Keyboard with touch id.
Any monitor and keyboard will work, however Apple keyboards have a couple extra keys not present on Windows keyboards so require some key remapping to allow access to all typical shortcut key combinations.
I'm in exactly the same boat. Yeah ofc you can run LMs on cloud servers but my dream project would be to construct a new gaming PC (mine is too old) and serve a LM on it, then serve an AI agent app which I can talk to from anywhere.
Has anyone had luck buying used GPUs, or is that something I should avoid?
I bought some used GPUs during the last mining thing. They all worked fine except for some oddball Dell models that the seller was obviously trying to fix a problem on (and they took them back without question, even paying return shipping).
And old mining GPUs are A-OK, generally: Despite warnings from the peanut gallery for over over a decade that mining ruins video cards, this has never really been the case. Profitable miners have always tended to treat these things very carefully, undervolt (and often, underclock) them, and pay attention to them so they could be run as cool and inexpensively as possible. Killing cards is bad for profits, so they aimed towards keeping them alive.
GPUs that were used for gaming are also OK, usually. They'll have fewer hours of hard[er] work on them, but will have more thermal cycles as gaming tends to be much more intermittent than continuous mining is.
The usual caveats apply as when buying anything else (used, "new", or whatever) from randos on teh Interwebz. (And fans eventually die, and so do thermal interfaces (pads and thermal compound), but those are all easily replaceable by anyone with a small toolkit and half a brain worth of wit.)
I found this recent thread interesting, specifically about really considering whether you're going to read the data you just wrote in the near future or not (in which case, use direct IO) and a set of (abandoned?) patches for write-behind caching for sequential writes in Linux (https://lore.kernel.org/lkml/156896493723.4334.1334048120714...).
The unlock tool would only work if it successfully authenticates with Xiaomi's server with matching Mi Cloud ID as the one previously registered to the device. So I very much doubt that it is stolen.
Glad to see that noise reduction is on the roadmap! Does Filmulator support embedded lens profiles? I enjoyed using Filmulator btw, its default output are just lovely.
https://www.linuxfoundation.org/legal/the-linux-mark