Let’s be honest; the voice API to integrate apps with Amazon’s Echo is remarkably clunky. Yes it gets the job done, but it’s neither intuitive or natural. Asking a question of Alexa is akin to doing a search on Google; you ask Google for something and Google returns the best answers it has. What you don’t do is ask Google to check Wikipedia, or IMDB, or Stack Overflow, or Amazon to see if they know the answer to your question. When Amazon recently added support for the Phillips Hue and Wemo, they didn’t make you say “Alexa, tell Hue to turn on the outside lights” but let you say “Alexa, turn on the outside lights” letting Alexa work out if it needed to talk to Hue or Wemo to get the job done. Of course, when Amazon controls how everything is integrated, it is relatively easy to add support for Hue and Wemo in a way that plays well with everything else Alexa does. However, with third party apps this sort of ad-hoc integration is harder to pull of; but the “tell ” syntax is a poor solution. So here is an alternative approach. I’ll start with a trivial example and then refine it into a more robust solution: Queries Consider every command spoken to Alexa as a search query. Each query is dispatched to all applications registers with your Echo, and each application returns a result if it can. Ideally the query only makes sense to a single app, and so once Alexa has gathered up there results, it only has one valid results to play to the user. With this we get what we want; ask a question, and get the answer from the app which knows how to respond to it, without having to direct the question to the app. Filtering Okay, amongst various issues with this simple approach, it seems rather inefficient. There is no point in directing a query to an app which cannot possibly answer it. Fortunately Alexa already has a mechanism, “Utterances”, which describe the sort of queries each app is capable of understanding. If we use these to filter which apps to send our query to, we can greatly reduce the number of apps we need to call. Revising our example, a query is spoken to Alexa, filtered against each apps utterances, and for those with a match, the query is dispatch to the app. All apps respond, but only a single response is valid, which we play to the user. Multiple answer We might hope to get only one answer from a group of apps, but we to handle multiple apps answering a query. Which answer do we choose? Every app will have a certain amount of confidence that it knows the answer. At one end of the spectrum, an app might absolutely know the answer to the question “What is the temperature outside?” - It’s 50 degrees in Berkeley today - and at the other an app might have no idea - I’m sorry, I don’t know how to answer that. There’s also some middle ground where an answer might be correct but not confident. For example, Alexa is very fond of reporting the outside temperature if it does not understand what you say, but your sentence contains the word “temperature”. Such an answer might be correct, but it has low confidence; perhaps another app has a better answer? So when an app sends an answer, it also send its confidence in that answer. Alexa can then choose the answer with the highest confidence. Side effects We also want Alexa to do things, not just answer our questions. These things, or side effects, might include playing music, turning on lights, or opening the garage door. In the scheme above we could be asking many apps to do something, but we only want one of them to actually do it. To make sure only one acts, we need to add a final step. Once Alexa has chosen the answer, it tells the apps which answer it has chosen; we tell each app whether its answer was accepted or rejected, allowing each to then perform, or cancel, the side effect. So, finally … Putting this all together we have the following: 1. Alexa received a sentence. 2. The sentence is filtered against all registered app utterances. 3. For each matching apps, the sentence is send to it for evaluation. 4. Each app returns a response together with its confidence in that response. 5. Alexa chooses the response with the highest confidence. 6. Alexa sends an acceptance to the most confident app, and a rejection to all the others. 7. The accepted app performs any side effects. 8. Alexa plays the accepted app’s response. This allows for queries to be answer by multiple apps, selecting only the best answer, acting on them, and providing a more natural way to interact with Alexa and the many, many apps that can be built for it. Thoughts and comments appreciated.
This approach assumes a "perfect world" where all developers cooperate fairly and act only in the best interest of the user. Unfortunately, this is not how the world works. I would guess that the vast majority of apps will just respond with a 100 percent confidence level for all requests/utterances that come their way resulting in less than trustworthy responses and actions taken by the wrong application. Imagine an utterance such as: "Schedule a call with the caterer for the surprise birthday party for my wife". I really want to know that this ends up in my personal calendar app, not the calendar app viewable by the whole family. I would feel much more comfortable with the utterance "Alexa open MyCalendarApp and schedule....". As a user, I would be very wary of having all of my utterances sent to all of the different apps I have registered to use.If I say "Alexa, turn on the alarm system". I really do not want this broadcast to all of the apps for which my Echo is registered to use. Ditto for "Schedule an appointment with my doctor". This not something I want all my different apps to be parsing. Etc, etc. Your filtering idea would help, but as a user, I am not going to be confident that the filters are specific enough and I would worry that confidential information ends up in the hands of the wrong apps. Also, it is assuming a lot to think that most users will really understand the filter concept. With such nagging concerns about privacy, I probably would not use my Echo. I think that "namespacing" of utterances is important so that the user is confident that the request is being sent only to the proper app.
A solution that combines the benefits and safety mentioned above is to let users "promote" whatever apps they want to the "global namespace". The global namespace enables a 3rd party app on a particular device to be launched like a 1st party app. When an app is promoted it doesn't require a user to use the launch phrase in order for its intents to be checked and invoked if matched. Problem solved! :) Stefan
Thanks of these comments; they've definitely given me more to think about in making Alexa more useful. I can't get away from the nagging feeling that if I had a personal assistant, which I like to think Alexa is, I would expert him to do the right thing, and not have to tell him how to do it. You highlight an obvious need to still be able to direct your speech to a specific app under some circumstances, but is that more an exception than a rule? If I had a personal assistant, I would trust them and so trust their judgement on whom they use to get my tasks done. Similarly, I want to be able to trust Alexa and the apps I've installed to do the right thing. It's a bit poor if I have to assume any app I've installed is out to get me. Which brings up the problem of how I establish that trust. Alexa's current scheme assumes no trust and requires us to direct our speech specifically to an app. To me that removes one of the advantages a personal assistant is suppose to provide. Unfortunately my scheme assume complete trust in the apps you install. You point out how my scheme is unrealistic because of that which is fair, but I think both are unrealistic in their own ways. So, how does one establish that an app they've installed can be trusted? Thinking on that ...
The idea of promoting an app is an nice one. If you decide to give that app your trust, then you essentially give if 1st party status. Other apps remain 2nd class, but you can still use them if you address them directly.
If Apple was selling Alexa, they'd review apps which might provide some extra level of trust in them. I can imagine they'd look at the apps "utterances" to see if they were too general or not. Being Apple I'm sure the definition of "too general" would be wonderfully opaque.
I really like this conversation because I too agree that specifying each app name every time is rather "clunky" and it would be nice if there were some better alternatives. The promoted, or trusted apps is a nice idea. I still would be nervous that malicious apps could get information from me which I really don't want to share with them. Maybe there could be a way to easily switch into a paranoid mode which would require the namespacing and similarly an easy way to drop back to a more trusting mode - the device LED could maybe be used to indicate the mode (red for paranoid, blue for trusting).
Thank you for your feedback. We appreciate your participation and interest in Amazon's Alexa Appkit developer program. We are always looking for new ways to improve the Echo and the Alexa AppKit. Your suggestions will be relayed to the development team, and as I am sure you can appreciate we are not able to comment on any speculative information.
This kind of thing may happen in other ways. Right now by launching an app you narrow down the context, which increases accuracy. However I can envision someday amazon triggering apps when it sees the context of an app. For example, if you say whats the stock price for appl, and amazon offers a stock category, it would tied to an app registered as the stock provider. Or maybe a couple of providers. That way the clunky will eventually go away.