How about a basic LAN interface without the complexity and hoop jumping?
Here's what I want to do. [i]1)[/i] In the Amazon app, or here on Amazon, or I don't really care where, I want to set up my own trigger. Let's say it will be "home stuff". [i]2)[/i] Same place, I set up an LAN addr + 2 ports, same subnet as Alexa. Here, it might be 192.168.1.69 and two ports, ie 49987 for me, and 49986 for Alexa. That's it for the setup. [b]A)[/b] I say "Alexa home stuff foo bar: [b]B)[/b] Alexa does speech-to-text, sends it to 192.168.1.69:49987 - I capture her IP address from this, "alexa_ip_addr" [b]C)[/b] My LAN app catches the message, which contains (JSON or whatever), "foo bar" [b]D1)[/b] I can respond with any action I like on my network, including... [b]D2)[/b] Send "what the heck does foo bar mean" to alexa_ip_addr:49986 which she them text-to-speeches accordingly. This would be SO simple to do, and so incredibly enabling, and low load, and reasonable, and as it will only talk to the LAN, needs no annoying and troublesome certificate, doesn't need the cloud, doesn't need preparation of canned speech segments (we can parse our own stuff... parsing is easy, I *really* don't understand why Alexa doesn't parse but uses these absurd predetermined speech fragments, but that's another subject. I sure don't need that. We're talking about a tiny bit of data for the Alexa app or an interface here on Amazon with same, and the most minimal networking I/O interface you can imagine. Easy-peasy. In Python, it's less than one 24-line screen of code. Don't know about Alexa's internals, but it should be similar, add a little bounds checking, and we're golden, all of a sudden Alexa can be much, much more useful. There are lots of people like me that do their own home automation. It'd be terrific if Alexa could be part of that. Look at the "simon says" feature. It's like that. Only it sends the text to me, and I send the thing to say back (or don't, if I don't need to.) I pine for this. I long for this. I wish for this. Can I have this? :o)
I'm not sure I understand exactly what you want. But, it seems to me it basically exists already. Just use a Lambda passthrough, like Matt Kruse's (I have it on my Github repo:
https://github.com/sarkonovich/Alexa-Hue/blob/master/lambda_passthrough.js) Point it to your local network (via port forwarding, or use an ngrok tunnel or the like) and that's it. If you look at the repo I pointed to above, you can see how I user the passthrough to my local network to control Hue lights. But, maybe (probably?) I'm missing exactly what you want. Steve
Thank you very much for your reply. I looked at the source code you pointed to, but I'm sorry, I don't understand what it is doing. I did see " 'amzn1.echo-sdk-ams.app.xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx':' http://xxxxxxxx.ngrok.io/lights'", which tends to lead me to think that this code is talking to Amazon or an SDK somewhere, rather than to Alexa on the LAN, but that is totally a guess. I don't speak java[script]; I speak Python and C and C++, so the code there is pretty opaque to me. It may be doing exactly what I suggested (or it could be a tic-tac-toe program... :) So let me try again, and perhaps it'll be clearer to you, and then your response clearer to me. What I want is not having to deal with anyone's API, including Amazon's. I just want Alexa, right here on the LAN, to send text to me at a LAN address and port I specify, which is her conversion from what someone said after catching a prefixed trigger word I specify. Then, if I send text to her IP at the port I say, I want her to speak that text. That's all. Nothing else. There is no reason I can think of to have an API between my code and Alexa, or for that matter, between Alexa and my code. Lets see if I can make the functionality clear. First I set up the IP of my server, and two ports, or one for me and hers could be fixed by Amazon. Now I can I say: "Alexa, homestuff mute the theater" Alexa gets that StT'd to the text "mute the theater" Alexa sends it to my IP and port as specified in the amazon app, or here in my account (I capture her IP here, so DHCP not a problem) I catch the message with "mute the theater" in it. I parse that, decide what to do. I might send something back for her at her IP and the port I chose (or is fixed by Amazon) to say, ie, "excuse me?" In which case, she TtS's that, and says "excuse me?". Or I might just mute my stereo via its TELNET interface and send nothing back, the action being the entire desired response. That's two interfaces. One her to my LAN, one my LAN to her. From what I read of the Alexa dev docs, this is not an available capability. Am I wrong?
>I *really* don't understand why Alexa doesn't parse but uses these absurd predetermined speech fragments If you don't then you won't understand why you can't do what you want. Accuracy. Speech to Text is hard. Very hard. Even though something like 90% accuracy sounds good, in reality it means, on average, one word in ten is wrong. That's not good enough to be usable. There are lots of technical tricks to improve accuracy. Plenty of PhD level research on the subject. But, the one guaranteed way to raise accuracy is through a restricted vocabulary. If a STT system already knows the breath of what you could say, then it's a much easier task to work out which of those you did say. This is why you have to define an interaction model of "absurd predetermined speech fragments". Most people have the misconception that Alexa first does a magical conversation of speech to perfect text, and then a rather simple match of that text to an intent. This is not the case. Alexa takes the speech, and matches it directly to the breadth of possibilities defined by your interaction model. It then picks the one with the highest confidence value and uses that. So, aside from the non-inconsequential security concerns of your suggestion, that is why you can't have it "just converse" like you desire.
I thought someone might bring these up. Fair enough. My take: First, the idea is to do this on the LAN. It's trivial to tell if you're dealing with the same LAN. if the LAN is set up (and remember, I specified that as a prerequisite) then she knows to only pitch to, and catch from, 192.168.1.x or similar. You can't easily spoof a back-and-forth handshake, because if you spoof your IP, the response isn't going to come to you. You have to compromise the entire LAN first so you can listen to everything. If you are in that situation... that the mal-whatever might be able to mute my stereo or change my lights or have Alexa say something surprising... that's not your problem, in fact, it'd point you to the fact that you *had* a problem. So additional exposure in re security concerns are pretty minimal -- the hardware is already exposed to whatever a person would say to Alexa, there isn't any significant additional exposure to what might be said to her on the LAN to her putative port. As for the other direction, if she sends what you say in response to a keyword (which she *will* know, again it's a prerequisite) so the StT is as easy as possible there), you intend to send the rest of what was said to the LAN, so (a) you are in control of what you do in response, and (b) you are again exposed, as Alexa or any subordinate hardware is, to exactly the same risks. If your LAN is compromised, your problems won't be centered around this simplistic exchange. Second, I'm well aware of the limits of accuracy. I'm perfectly content accepting the input that Alexa comes up with at 90% or 80% or even 50% accuracy. Because right now I have nothing. Alexa brings a decent multiple microphone / sound output package with a LAN interface to the environment. This could be leveraged. Having said that, no doubt there will be things that are more accurately recognized than others. Some experimentation is likely to turn that up. If not, nothing lost -- I don't have any capability like this now, so no change. On the other hand, if I could get it to do reasonable things, then huge gain. Yes, if I was talking about a cross-WAN capability, https provides encryption (although again, if your machine is compromised, required for your security concerns to raise their heads both encryption and identity are compromised as well), and that's desirable.. But that's not what I'm talking about. Just a local back and forth. Thanks for your post.
Well, good luck. I think, fundamentally, if it was "easy and simple" then they would have done it that way to start with. They have chosen the "complexity and hoop jumping" for very specific corporate reasons. Big companies have a lot of things to balance. They have to consider all scenarios that may add risk to their user base, reputation, branding and client share. If you look at things from their perspective, they make more sense. I'd love to be proven wrong. But time will tell.
Again, thanks for your reply. I appreciate the time taken. I see it as pretty simple. As soon as there is a decent open speech recognition system, this will be inevitable, because the hardware requirements are ludicrously simple - a microphone, an amp, an A/D and a very small computer, ie an rPI, etc. Listen, digitize, send along; take input, TtS output... and the potential benefits to the local user are almost unimaginably broad and easily accessible as to compared to the system such as the one in place now. At that point, Alexa becomes severely disadvantaged, feature-wise, if the same capability is not available in her context. You're absolutely right about corporate influences that are other than technology-based. This is not the first thing to come out of the box less than it could have been technical for non-technical reasons. However, Amazon has demonstrated in several ways that they will, from time to time, do reasonable things even when it's not immediately obvious that it adds to the bottom line, or even if the immediate effect is negative. For instance, they allow negative product reviews; another is that they have return policies that are truly amazing. So I thought I'd ask where someone from Amazon might actually see it. I'm not emotionally crushed or anything. And I'm patient. :)
@Ben -- at least for now there is no Open Source library for voice recognition. The other day Jaquinta suggested the use of the Watson voice service, and Google also has a voice service, and Apple will soon open up Siri. But in all cases, we are stuck jumping through hoops to use someones voice service API. That will probably change over the next 10 years, as more voice technology gets released as Open Source. But don't expect to see any big changes this year.
It isn't that there aren't any open source Speech To Text libraries. Here's a list from Wikipedia:
https://en.wikipedia.org/wiki/List_of_speech_recognition_software#Open_source_acoustic_models_and_speech_corpus It is just that they are likely to be crap. Amazon has put no small effort into the pipeline they have created. This isn't easy stuff. And, however much we might complain here about the limitations of what Amazon has done, (like no generic slots) it has mostly done that with the aim of increased accuracy. If you switch to one of the OpenSource options, I think you will find you may gain in flexibility, but you will lose a lot in quality. If someone wants to try, I'd be interested in the results. I tried FreeTTS with the Java Speech API a while ago for a different project. But I could never get it to work.
@jjaquinta -- "I think you will find you may gain in flexibility, but you will lose a lot in quality." Agreed. I like the suggestion you made the other day about using the Watson Voice Service. At this point, I am feeling enough frustration with Amazon that I might give up on the Echo and instead build an Android app that relies on the Watson Voice Service. Of course, "lose a lot in quality" is an initial state. Given 6 months and a highly focused dataset, it should be possible to exceed the accuracy that Amazon offers. A reasonably funded startup, focused on a narrow set of words, should be able to highly optimize for those words. But what Amazon offers is the chance to get going quickly, with very little funding.