Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Can we build Ironman's Jarvis with 2019 tech?
99 points by hsikka on May 2, 2019 | hide | past | favorite | 62 comments
I was rewatching movies in preparation for the new Marvel movie, and I felt some nostalgia and childhood fascination when I saw the depiction of Jarvis in the Ironman movies.

I work in ML research and am currently in graduate school, and I know we're nowhere near real intelligence. But some of those features, question generation, voice commands, object detection, image reconstruction, and others are certainly doable.

Do you think we could build something starting to approach Jarvis in 2019, at least in function i.e. helping your everyday work?



I'm pretty much doing that very thing. I'm quadriplegic, so every single action I want to take in the world has to be mediated through a third party. That said party is anything from another person to one of the digital assistants.

As another poster mentioned I think, I control pretty much everything in my house including but not limited to unlocking doors, windows, climate control, the lights, posting to Twitter and on and on and on with a cobbled together solution of many different components from different companies.

It's done with a combination of all three of the digital assistants, various scripts and the very wonderful Home Assistant[^1] mostly gluing it altogether. It is by no means a single complete solution, but when I'm controlling the house using just my voice and somebody sees me doing it they react as though I'm some sort of dark wizard.

I love having an automated house and I think Home Assistant is probably one of the best solutions for making all of the different IoT devices communicate at the moment, I think the further down this road we go the more a single solution will probably evolve.

I would have to say though that the home automation stuff that enables me to call for help in an emergency is the most important thing that it does, I can flash the lights different colours when I need help depending on the level of assistance I need. Or if my Fitbit notices that my heart rate has gone below 40 BPM for more than 10 seconds I get a notification, as it is almost indicative of an attack of Autonomic Dysreflexia. I can honestly say that this home automation system has actually saved my life without an ounce of hyperbole.

[^1]: https://Home-assistant.io/


>if my Fitbit notices that my heart rate has gone below 40 BPM for more than 10 seconds I get a notification, as it is almost indicative of an attack of Autonomic Dysreflexia. I can honestly say that this home automation system has actually saved my life without an ounce of hyperbole.

I'm a #FitbitEmployee but I don't speak for Fitbit. Speaking personally, I'm happy that it's worked well for you so far, but I'm concerned if it's likely that your life will depend on consistently good HR from a Fitbit. Although manufacturing has improved a lot since Charge HR days, they still aren't medical-grade devices. You should talk to a medical professional about what to use to monitor your heart rate.


Thank you for your concern, but I never and would never place my life in the hands of any one piece of anything really. Be that technology, person or process.

The Fitbit has given me a heads up a little earlier than I otherwise would have got it, the value added for me is that it's cheap and a great indicator. But an indicator is all it is and it is part of a wider protocol we have in the house for preventing and detecting Autonomic Dysreflexia. A vital part, but still just one component.

I do wish I could turn off the step counter though, and if it could detect when I'm asleep that would also be great. The former I don't think I'm ever going to do very well on and the latter is really not the fault of the device as I don't move around very much at all.

Thanks again for your concern! :-)


Good hearted advice but lacking in few aspects. Most medical products are terrible with open integrations and open standards.

I suspect this kind of advice is given by consumer companies for fear of liability than anything else.

In many cases, medical devices that have consumer equivalents don't have anything extra than a better written user agreement and little more liability.


The distinction between medical and non-medical devices is meaningful: https://en.wikipedia.org/wiki/Medical_device


And if you want to add a 100% offline and private-by-design Voice AI which can run on a Raspberry Pi 3 (and iOS, Android, Linux), take a look at what we are building at https://snips.ai (it also integrates to Home Assistant)

It works for english, german, french, japanese, spanish, italian, and more coming soon!


This looks very close to what I was looking for about a year ago when I wanted to try my hand at making my own 'personal assistant'. I was thinking something like the Star Trek computer but just for data lookups and binary execution rather than calculations and data analysis.

e.g. "Computer, look up 'can cats eat pancakes?'" which would just open Firefox and input that in a DDG search, or, "Computer, play the song 'Never Gonna Give You Up'" which would open WMP and play that song from my offline music library. Ideally it would vocally respond with some pre-baked yes/no responses before executing the command.

Is that doable with your software? Does it have a Python API?


Can you email me, I would love to move more and more of this stuff inside the wire and if we can do something that we can then share with other quadriplegics so they can keep control of their own medical info; that would be very groovy.


Looks awesome! Any plans for FreeRTOS?


Snips.Ai runs as docker container in home assistant. I'm sure you can docker stuff on FreeRTOS


Thanks for the insights into your setup. If you don't mind my asking, how do you go about interacting with traditional desktop and web-based interfaces? Is there anything that a web developer like me should know in order to make life easier for you?


I would very much like to learn more about what you're doing. My wife sustained a spinal cord injury and while is still able to walk with assistance, there are issues that she still has that present fall risks and with some sensory issues.

I've been adding home automation in little bits at a time as we can afford it also using Home Assistant and working to make the system as bullet proof as possible.


Sounds very cool! What are you using for sensors and physical hardware? I mean light switches, temperature control etc - I browsed the home assistant site but didn't really come out with an understanding of what hardware the system can communicate with to, say, turn off a lightbulb.


You use a component to connect to some networked light control system. The "Light" category has 71 components: https://www.home-assistant.io/components/#light

An example would be IKEA Trådfri (https://www.home-assistant.io/components/tradfri/); you connect the lights to its gateway device, then Home Assistant will use the API provided by that gateway to control the lights.


Thank you, components is what I needed to find!


Very interesting information on that site and it's good to see you involved in a part of the programming world.


Do you have a good "beginners guide to the ultimate voice setup"?


Siri is still half deaf and tries to send me to the chip shop when I ask it to send a message and they have a ridiculous amount of budget behind that so probably not.

At the moment our technology is like Zorg’s desk in Fifth Element. Stuffed full of toys but ultimately will let you down too often to be relied upon.


"like Zorg’s desk in Fifth Element"... What a great analogy!

For those that don't know the reference: https://www.youtube.com/watch?v=krcNIWPkNzA

(I think this is my favorite Gary Oldman character.)


Google no better. You still can't "send my wife a Slack message" or "read me the last SMS received" or even "send a hangouts message to Bob".

Some of the integrations are willfully missing for obvious business reasons.


"send me to the chip shop"

Do you have an accent by any chance - pretty much all voice recognition technology (Android, Apple, cars...) appears to be utterly flummoxed by my fairly mild Scottish accent.

[Edit: to be fair, my accent is probably fairly mild by Scottish standards, not compared to RP].


Must be it, a golden oldy about Scottish speech recognition (2010): https://youtu.be/NMS2VnDveP8


I have to stop myself posting that every time the subject comes up (I know what it is without opening the link) :-)


I’ve got a fairly heavy London accent.


Maybe we should join forces and found a voice technology testing company!


This. What are the people in this thread talking about? Siri and Alexa mess up way too much.


You could probably get 90% of the utility with a janky collection of shell scripts fronted by a voice recognition engine. Think Alexa/Google Assistant/Siri/etc with a bunch of company- or task- specific scripts. I purposefully ignore the robot-arm in an open universe capability, because that's still too far away, AFAIK.

Jarvis in the movie was Hollywood smooth: made to look great in a movie. Even the goofs were great, and it behaved like a puppy, to be likable to the audience. You don't need that in a work assistant. But if you remove that, plus the robot-arm, you're left with a voice controlled Outlook assistant, which is useful, but not sexy.


Actually you could've built Alexa/Siri etc back in 1997 with DragonDictate and a whole bunch of scripting.

And in many areas it would've been just as good as what we have today.


I had a better UX around playing music with WinAMP, Microsoft Speech API, and a little program I wrote to glue the two together via Win32 API WM_ messages, in 2007, than I have today with Google Now.

Today the problem is a) the voice recognition part, which happens on-line though it shouldn't (introducing large latency and unnecessary Internet dependency), and b) scriptability, which is something none of the voice assistant vendors want to give to you, for business reasons. A better reality would have voice parsed on-device (possibly with models from on-line service providers), and available as a voice-to-text API that could be easily accessed by local scripts.


I completely agree.

A friend once said all we need is 'the sinew between the questions'. Connecting them (contexts, intents, etc) and making it close to 1.0 probability on decisions through the entire conversation.

I think the 'marble rock' of Jarvis is there already, and with ML/AI we can likely make it 'David-shaped' but the refinement, 'warmth' and complexity of Jarvis is man-hours that would likely be better spent elsewhere.


I'm not sure I get your references (marble rock, David-shaped).


It's a reference to Michelangelo's David and this:

Every block of stone has a statue inside it and it is the task of the sculptor to reveal it.


There's also the the voice-controlled CAD ("throw in a little hot rod red in there").


I'm guessing this is where the Dragon Helmet [1] is headed - it has software (open source) that hopes to fulfil the responses and command execution. And of course mycroft [2] et al.

I think the real problem is getting the computer to do things that you haven't programmed it to do. If you imagine an Alexa / Google home where you have a conversation like:

"Hey Homexa, do a web search for an image of a piece of cheese, enlarge it to 2000x1500 pixels and email it to me please"

If it then came back with "Ok, please tell me how to enlarge an image" and you could give it step by step instructions, that would be amazing. But I don't think the comprehension required is quite there yet. But when we do, and especially if we can crowdsource the learned commands, things could get very, very interesting.

[1] http://dragon.computer/

[2] https://mycroft.ai/


I don't think we are anywhere close to Jarvis which is general AI ( aka real AI aka "conscious" AI ) no more than we are anywhere close to building Ironman's nanotech suit. But we are improving on "dumb/unconscious" AI like Alexa, Siri, etc that could help with day to day life whilest spying on you 24/7.


I think before we talk about capabilities we should talk about UX.

Even the first version of Jarvis shown in Iron Man 1 was way ahead in terms of understanding speech and context compared to the cutting edge we have now that is Google Assistant.

At least Google Duplex sounds super natural now. I think when digital assistants can really understand us, adding all those capabilities is just a piece of cake. But I have a feeling we are just probably at most a decade away from something that vaguely resembles that, and I'm super excited about it.


The movies (and the major real-life assistants) tend to focus on the idea that you're talking to your computer as if it were a person, but this makes the problem an order of magnitude more difficult than it needs to be. What if we looked at it more like a vocal command prompt? What if we applied the Unix philosophy of composing programs and data transformations into more complex use-cases? You'd have to memorize the commands (though there could be vocal "man pages" too), but could finally do more than toggling lights and asking what the weather is like.

There's some precedent for this; in Avengers Endgame (I'll keep it vague), Tony gives some very precise and technical commands to the system when running a scientific simulation. He's not conversing and quipping, he's basically calling parameterized functions with his voice. I think that would be very doable today.


Ha, this is how we've been advertising our wearable, personal assistant (aidlab.com) during the launch on Facebook (such a copyright infringement!). We've been facing the hardware limitation to add the NLP directly on the MCU, as we wanted to omit the 3rd parties. Secondly, adding ‚reasonable’ voice commands like setting appointments, wiki search or alarms is nothing we could have done alone, so we’ve discontinued this idea for now.

If we define J.A.R.V.I.S. as a pseudo-intelligent assistant, with a basic understanding of voice commands and biosignal tracker then it’s certainly doable, as there are some commercial implementations. Otherwise (like real-time voice chat, or even accurate suggestions based on every-day habits), it is still too soon IMO.


I think we are closer than we think.

The problem is that all voice control up until now is a closed app with simplified intents.

In my opinion what we need is a programming language whose REPL UX is voice-oriented, that eases the separation between using and extending the system. Prolog or a lisp could be quite near but we would require some changes such as: syntax easily read (:- in prolog is not very good), the ability to manage ums and ahhhs, the ability to “play” or “test” with functions through voice, etc.


I once toyed with the idea of a machine-oriented spoken conlang similar to Lisp in "syntax" by reserving certain syllables for punctuation (brackets, etc.) One of the main problems of existing solutions is that they lack flexibility due to the poor state of natural language understanding, which is solved by pretty much removing natural language from the equation. However, it would be unusable for most people, or at least would have a learning curve too steep for it to be practical.


We need some type of Structured Query Language, call it SEQUEL, or maybe abbreviated SQL for short. /s

Jokes aside, maybe a LISP-like Sentence Processing Language might be the future of this type of HCI.


Look into Attempto Controlled English for an alternative closer to what I was thinking about :)


I wrote a little shell script that lets you move the mouse around with your voice, ask what time it is, etc. My thinking for future applications is that specific appliances should take specific commands, so there would be little intelligence needed. Oven: preheat to 350 degrees. Lights: on. Music: off. Etc.


Sure you can. Services like Alexa are just glorified paperweights and aren't reflective of real world capabilities.

If you want to build your own Jarvis, don't listen to anyone else just get stuck in. You'll be surprised just how far you can get with it.


You could also build on top of one of the open source versions, eg Mycroft.

https://mycroft.ai/


That's interesting. Basically the only reason I don't use Alexa or something like this is that I don't want 3rd party listening to me 24/7, so an open source alternative would be of great use. But I don't understand: it says it's open-source, yet the first thing I have to do is to create an account on their platform. So, what's that "open source" thing I ought to download and install, is it just a client? Can I use the thing completely self-hosted?

Also, how does it compare in terms of usability to Alexa/Siri? In fact, I'm not even sure I've seen a good open-source TTS at this point, nothing comparable to proprietary stuff. Is this mycroft thing better?


I've recently started playing with a Home Assistant set up and came across Snips (https://snips.ai/). A bunch of Raspberry Pis with microphones running Snips is what I'm looking to try next. IMO, the offline/on-device processing is a key differentiator.


I haven't tried, but from what I've gathered you can either run it yourself on eg a raspberry pie or use their infrastructure instead if you don't want the hassle and expense. I have the same concerns.

Here's another project btw, which seems less commercial:

https://jasperproject.github.io/


Exactly. They are all developed with the intent to have one party control the apps and partnerships, which in itself limits it's possibilities from the start.


Well, for the speech detection part, I think, there are some projects that look promising.

I recently found https://github.com/gooofy/zamia-speech, which works for english and german. German is pretty bad atm, but english should work fine.

Nice part: You only have to spin up a docker container and a python script and can perform offline speech to text :-) Microphone input is also supported.


If you are really asking for it to be like Jarvis, where "like" implies some reasonably similar approximation, then:

No. Absolutely not. Not even close.


Surprised that nobody mentioned this

http://jasperproject.github.io/


Shameless plug: https://github.com/synesthesiam/rhasspy

Rhasspy is inspired by Jasper, but works by having you specify intents/voice commands via a grammar (rather than via Python). It then trains a speech and intent recognition system together to recognize those commands. Recognized commands are published over MQTT or POST-ed directly into Home Assistant.



You would be able to integrate Hal (http://halisback.com) that’s is the best assistant out there to all the different components from different companies and make him do basically everything


If we could, we would have. But we haven't, so we can't.


It has to have a usage though. We ran a long proof of concept on voice recognition as in voice-to-text, for office workers and had good results once the software learned to recognise correctly.

I think the best cases were 20-30% increased productivity for a single worker, but it was ultimately useless because we use open offices, so the project got scrapped.


Without context, this statement means nothing new should ever be done. It's a terrible axiom. Even within the context of this question, if we used this reasoning for the next 500 years we would surely never solve it.


Oh sure, but that's why you have the context. My point is that the time gap between a tool like this becoming technically possible and someone executing it successfully will be incredibly small. So the chances of the current moment in time being in that gap is tiny.


This is some oversimplified bullshit.


Like e = mc^2?


Everyone here is focusing on the voice assistant and IoT piece, but the thing that's really missing is the idea of a contextually rich personal API and some POSSE mechanism to selectively share that API with others and vice versa.

Quantified Self movement is maybe 30% of the way.

Until this space matures we're kind of a standstill. Commercial options have to be generic or face the wraith of GDPR, data breaches, and privacy advocates.

The only way to square the circle is let people own their data completely and then incentivize them to share with your service for the benefits you provide.

I think the killer app is an Alexa/JARVIS clone that "spies on you 24/7" but keeps your digital twin 100% offline and owned by you. It's the learning and the "personal schema" that's compelling.

When it knows why you want to turn on the lights we're getting somewhere.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: