More

JohnTheNerd · on June 15, 2024

that's a great idea! I've been looking into that (I'm merely logging all prompts in a JSON file for now, so that I can analyze them later).

skipping the LLM would be tough because there are so many devices in my house, not to mention it would take away from the personality of the assistant.

however, a recommendation algorithm would actually work great since i could augment the LLM prompt with it regardless of the prompt.

JohnTheNerd · on April 12, 2024

If you are scared of messing with electricity like I am, instead of using power monitoring, another viable but less reliable option is to use vibration sensors.

I picked up a simple Zigbee vibration sensor for less than 20$, taped it on top of the washer/dryer, and connected it to HomeAssistant using Zigbee2MQTT. After creating two automations to detect start/stop based on continuous vibrations over a few minutes, I had the notifications working the exact same way.

hinkley · on April 12, 2024

I don’t know about economical but there are devices that sense current flow in a line without a tap, just from the EM fields. People with wood shops use these to automatically turn on the dust collection system when any power tool is turned on.

Between small production sizes and the relay there are probably cheaper DIY solutions. OP is just using a smart plug, so no real electronics involved.

gambiting · on April 12, 2024

The problem with those is that they only work if they are wrapped around a live wire, not live and neutral. So unless you have exposed L/N wires in your cable, it's not going to work at all.

Scoundreller · on April 12, 2024

These clamp on systems may not work too well on North American 240V circuits like a dryer since the emf will cancel, no? You have two 120V feeds exactly 180 degrees out of phase with eachother. Unless you unravel the wires a bit and just clamp a single conductor.

Or maybe I have this all wrong.

jtriangle · on April 12, 2024

You are correct, current clamps can't read accurately in this situation. All you'll get is any disparity between the phases, which, will be small.

That said, it's probably worth a try, because it's likely that internally the drier's control circuits are only tapping one leg, and you might be able to read that, and it might change reliably enough to trigger an alert.

Usually, you'd use a vibration sensor for this job however. You can also use a thermal sensor on the exhaust air, combined with a time delay (because they run cooldown/dewrinkle cycles), and you're good to go.

pavel_lishin · on April 12, 2024

Huh, I thought 240V circuits used some sort of three-phase system.

JohnTheNerd · on Jan 15, 2024

I think you have a valid point, but the risk of this feels exaggerated.

I already had a few entities I didn't really need it using (not for security reasons, but to shorten the system prompt). I simply excluded them within the Jinja template itself. I can see this being a problem with people who have their ovens or thermostats on HA, but I don't necessarily think it's an unsolvable issue if we implement sensible sanity checks on the output.

hilariously, the model I'm using doesn't even have any RLHF. but I am also not very concerned if GlaDOS decides to turn on the coffee machine. maybe I would be slightly more concerned if I had a smart lock, but I think primitive methods such as "throw big rock at window" would be far easier for a bad person.

when it comes to jailbreak prompts, you need to be able to call the assistant in the first place. if you are authorized to call the HomeAssistant API, why would you bother with the LLM? just call the respective API directly and do whatever evil thing you had in mind. I took an unreasonable number of measures to try to stop this from happening, but I admit that's a risk. however, I don't think that's a risk caused by the LLM, but rather the existence of IoT devices.

JohnTheNerd · on Jan 15, 2024

I recommend opening the original link if possible, because the archive link is missing the demo video and a few important updates to the jinja templates!

JohnTheNerd · on Jan 14, 2024

that is correct! the less I rely on external companies and/or servers, the happier I am with my setup.

I actually greatly simplified my infrastructure in the blog... there's a LOT going on behind those network switches. it took quite a bit of effort for me to be able to say "I'm comfortable exposing my servers to the internet".

none of this stuff uses the cloud at all. if johnthenerd.com resolves, everything will work just fine. and in case I lose internet access, I even have split-horizon DNS set up. in theory, everything I host would still be functional without me even noticing I just lost internet!

JohnTheNerd · on Jan 14, 2024

I do it, but I'm completely insane:

- I actually stay on top of all patches, including HomeAssistant itself

- I run it behind a WAF and IPS. lots of VLANs around. even if you breach a service, you'll probably trip something up in the horrific maze I created

- I use 2-factor authentication, even for the limited accounts

- Those limited accounts? I use undocumented HomeAssistant APIs to lock them down to specific entities

- I have lots of other little things in place as a first line of defense (certain requests and/or responses, if repeated a few times, will get you IP banned from my server)

I would not recommend any sane person expose HomeAssistant to the internet, but I think I locked it down well enough not to worry about a VPN.

localtoast · on Jan 15, 2024

> - Those limited accounts? I use undocumented HomeAssistant APIs to lock them down to specific entities

Mind sharing your process to achieve what sounds like successful implementation of the much-requested ACL/RBAC support?

JohnTheNerd · on Jan 15, 2024

"successful" is a very optimistic way of looking at it. it has several downsides but largely works for my needs:

- read access is mostly available for sensors, even if access wasn't granted.

- some integrations (especially custom integrations) don't care about authorization. my fork mentioned in the blog does, because I explicitly added logic to authorize requests. the HomeAssistant authorization documentation is outdated and no longer works. I looked through the codebase to find extensions that implement it for an example. maybe I should submit a PR that fixes the doc...

- each entity needs to be explicitly allowed. this results in a massive JSON file.

- it needs a custom group added to the .storage/auth file. this is very much not officially supported. however, it has survived every update I have received so far (and I always update HomeAssistant)

I will share what I did in detail when I get some time on my hands

localtoast · on Jan 15, 2024

Much appreciated. Sounds as if you're way out of spec. Still; should be interesting to go through your methods.

JohnTheNerd · on Jan 14, 2024

it actually works really well when I use it, but is slow because of the 4060Ti's (~8 seconds) and there is slight overfitting to the examples provided. none of it seemed to affect the actions taken, just the commentary.

I don't have prompts/a video demo on hand, but I might get and post them to the blog when I get a chance.

I didn't intend to make a tech demo, this is meant to help anyone else who might be trying to build something like this (and apparently HomeAssistant itself seems to be planning such a thing!).

JohnTheNerd · on Jan 14, 2024

thank you for building an amazing product!

I suspect cloning OpenAI's API is done for compatibility reasons. most AI-based software already support the GPT-4 API, and OpenAI's official client allows you to override the base URL very easily. a local LLM API is unlikely to be anywhere near as popular, greatly limiting the use cases of such a setup.

a great example is what I did, which would be much more difficult without the ability to run a replica of OpenAI's API.

I will have to admit, I don't know much about LLM internals (and certainly do not understand the math behind transformers) and probably couldn't say much about your second point.

I really wish HomeAssistant allowed streaming the response to Piper instead of having to have the whole response ready at once. I think this would make LLM integration much more performant, especially on consumer-grade hardware like mine. right now, after I finish talking to Whisper, it takes about 8 seconds before I start hearing GlaDOS and the majority of the time is spent waiting for the language model to respond.

I tried to implement it myself and simply create a pull request, but I realized I am not very familiar with the HomeAssistant codebase and didn't know where to start such an implementation. I'll probably take a better look when I have more time on my hands.

puchatek · on Jan 14, 2024

So how much of the 8s is spent in the LLM vs Piper?

Some of the example responses are very long for the typical home automation usecase which would compound the problem. Ample room for GladOS to be sassy but at 8s just too tardy to be usable.

A different approach might be to use the LLM to produce a set of GladOS-like responses upfront and pick from them instead of always letting the LLM respond with something new. On top of that add a cache that will store .wav files after Piper synthesized them the first time. A cache is how e.g. Mycroft AI does it. Not sure how easy it will be to add on your setup though.

JohnTheNerd · on Jan 14, 2024

it is almost entirely the LLM. I can see this in action by typing a response on my computer instead of using my phone/watch, which bypasses Whisper and Piper entirely.

your approach would work, but I really like the creativity of having the LLM generate the whole thing. it feels much less robotic. 8 seconds is bad, but not quite unusable.

regularfry · on Jan 14, 2024

A quick fix for the user experience would be to output a canned "one moment please" as soon as the input's received.

balloob · on Jan 14, 2024

Streaming responses is definitely something that we should look into. The challenge is that we cannot just stream single words, but would need to find a way to learn how to cut up sentences. Probably starting with paragraphs is a good first start.

JohnTheNerd · on Jan 14, 2024

alternatively, could we not simply split by common characters such as newlines and periods, to split it within sentences? it would be fragile with special handling required for numbers with decimal points and probably various other edge cases, though.

there are also Python libraries meant for natural language parsing[0] that could do that task for us. I even see examples on stack overflow[1] that simply split text into sentences.

[0]: https://www.nltk.org/ [1]: https://stackoverflow.com/questions/4576077/how-can-i-split-...

JohnTheNerd · on Jan 14, 2024

power consumption. I am running multiple GPUs somewhere residential. the 4060Ti only draws 180W at max load (which it almost never reaches). 3090 is about double for 1.5x the VRAM, and it's notorious for briefly consuming much more than its rated wattage.

this isn't just about the power bill. consider that your power supply and electrical wiring can only push so many watts. you really don't want to try to draw more than that. after some calculations given my unique constrains, I decided 4060Ti is the much safer choice.

Havoc · on Jan 14, 2024

>3090 is about double for 1.5x the VRAM

Not just that - tensorcore count and memory throughput are both ~triple.

Anyway, don't want to get too hung up on that. Overall looks like a great project & I bet it inspires many here to go down a similar route - congrats.

geerlingguy · on Jan 14, 2024

A 3090 or 4090 can easily pull down enough power that most consumer UPSes (besides the larger tower ones) will do their 'beep of overload', which at best is annoying, at worst causes stability issues.

I think there's a sweet spot around 180-250W for these cards, unless you _really_ need top-end performance.

Havoc · on Jan 14, 2024

To me it's the PCI lanes that are the issue. Chances of a random gamer having a PSU that can run dual cards is excellent...chances of dual x16 electrical not so much.

I tried dual in x16 x4 and inference performance cratered versus a single

JohnTheNerd · on Jan 14, 2024

yes, they are the 16GB models. beware that the memory bus limits you quite a bit. however, buying brand new, they are the best VRAM per dollar in the NVIDIA world as far as I could see.

I use 4-bit GPTQ quants. I use tensor parallelism (vLLM supports it natively) to split the model across two GPUs, leaving me with exactly zero free VRAM. there are many reasons behind this decision (some of which are explained in the blog):

- TheBloke's GPTQ quants only support 4-bit and 3-bit. since the quality difference between 3-bit and 4-bit tends to be large, I went with 4-bit. I did not test, but I wanted high accuracy for non-assistant tasks too, so I simply went with 4-bit.

- vLLM only supports GPTQ, AWQ, and SqueezeLM for quantization. vLLM was needed to serve multiple clients at a time and it's very fast (I want to use the same engine for multiple tasks, this smart assistant is only one use case). I get about 17 tokens/second, which isn't great, but very functional for my needs.

- I chose GPTQ over AWQ for reasons I discussed in the post, and don't know anything about SqueezeLM.

faeriechangling · on Jan 14, 2024

> however, buying brand new, they are the best VRAM per dollar in the NVIDIA world as far as I could see.

3060 12gb is cheaper upfront and a viable alternative. 3090ti used is also cheaper $/vram although a power hog.

4060 16gb is a nice product, just not for gaming. I would wait for price drops because Nvidia just released the 4070 super which should drive down the cost of the 4060 16gb. I also think the 4070ti super 16gb is nice for hybrid gaming/llm usage.

JohnTheNerd · on Jan 14, 2024

that is true, but consider two things:

- motherboards and CPUs have a limited number of PCIe lanes available. I went with a second-hand Threadripper 2920x to be able to have 4 GPU's in the future. since you can only fit so many GPUs, your total available VRAM and future upgrade capacity is overall limited. these decisions limit me to PCIe gen 3x8 (motherboard only supports PCIe gen 3, and 4060Ti only supports 8 lanes), but I found that it's still quite workable. during regular inference, mixtral 8x7b at 4-bit GPTQ quant using vLLM can output text faster than I can read (maybe that says something about my reading speed rather than the inference speed, though). I average ~17 tokens/second.

- power consumption is big when you are self-hosting. not only when you get the power bill, but also for safety reasons. you need to make sure you don't trip the breaker (or worse!) during inference. the 4060Ti draws 180W at max load. 3090's are also notorious for (briefly) drawing well over their rated wattage, which scared me away.

Jedd · on Jan 14, 2024

Great, thanks. Economics on IT h/w this side of the pond are often extra-complicated. And as a casual watcher of the space it feels like a lot of discussion and focus has turned towards, the past few months, optimising performance. So I'm happy to wait and see a bit longer.

From TFA I'd gone to look up GPTQ and AWQ, and inevitably found a reddit post [0] from a few weeks ago asking if both were now obsoleted by ELX2. (sigh - too much, too quickly) Sounds like vLLM doesn't support that yet anyway. The tuning it seems to offer is probably offset by the convenience of using TheBloke's ready-rolled GGUF's.

[0] https://www.reddit.com/r/LocalLLaMA/comments/18q5zjt/are_gpt...