Hi, I have been working on a project to see if it's possible to create new utterances that can trigger an intent after the application has been install (i.e. without modifying the application code). The currently supported utterance and intent definition on the SDK provides the function to be able to take a spoken sentence and have it translate into a function call where the argument of the function call are extracted from the sentence. However the new functionality I am trying to add is to trigger learning when the sentence does not fit any of the pre-defined utterances, after that with some question-answer type interaction the sentence can be learned for future use. The value I see is that this allows for different users to have different utterances for the application without involving the app-developer. The prototype is not tied to echo yet, and is pure text based but is good as a proof-of-concept. Please share feedback and thoughts on the direction. Link to blog:
http://sushilks.blogspot.com/2015/09/understanding-what-is-being-spoken.html Link to Code :
It's a nice idea, in the abstract, but it's not going to happen for Alexa. You see, the biggest problem Alexa faces is recognition. This is why Alexa requires a restricted vocabulary. The more restricted the vocabulary, the better the recognition. The scheme you propose requires both perfect recognition and also an unrestricted vocabulary. The technology isn't there yet. Based on my experiences, Alexa is about 90% accurate in its recognition. Sounds great, but that means that it's wrong one time in ten. It's hard to have a sustained conversation with that error rate. Your scheme is not very sensitive to errors. Each error would kick off a learning cycle. There isn't any way to distinguish between an error in recognition and the user using a new phrase. Given the current error rate, that makes it unworkable in the field. But, still, if this hasn't been covered before, you might consider patenting it. For patents it doesn't matter if the idea is currently implementable or not. :-)
This fits directly into some of the things I've been requesting. What you, and I, and I'm sure many others seem to need is as fully dynamic an interaction model as possible. The whole thing, intent schema, slot definitions, sample utterances, should be able to be generated by the skill itself and passed to Amazon. To be truly dynamic, it should be doable not just on skill installation, but on the fly (for example, a response attribute that triggers Amazon to re-query the interaction model from the skill endpoint). This would allow for such features as user configurable and/or adaptive intents, utterances, slots, etc. My skill endpoint currently does generate the intent schema and a lot of sample utterances. The later are based on the user's own personal setup as they are unique to each user. Right now this feature is there so the user can cut and paste the schema and utterances into the skill setup interface. It would be really cool if it could be read automatically. On the subject of ways to adapt to what the user actually says rather than what they are expected to say, I have a feature in my skill that performs a basic adaptation. The skill is an interface to a home automation system called HomeSeer. There are names of devices that can be controlled and events that can be triggered. The names of those devices and events are taken from the user's system. Since the names aren't always things that are natural to say, I included the ability to create aliases on the fly while interacting (and also in a config file). There's no machine learning here, the user has to trigger it manually, but it works very well for myself and the handful of people using it. My code remembers the last incorrect device or event name, and the last correct device or event name. When the user says, for example, "alias that", it then sets a permanent alias from the former to the latter. This is how I made my system correctly turn off the den lights even when Alexa thinks I'm saying ten lights.
To address it from the north side of Amazon's API there will be access need to get to the probabilistic data that was used for forming the sentence (Possibly multiple sentences that are candidate with some probability attached to it). All this is being handled within the Echo's Platform and is not exposed via the API, so I think the problem is best addressed by Amazon platform itself. However if amazon was to provide some of this via an API I would volunteer to code up this framework :-).( I am sure the community will jump in as well). Here is how I see the ideal SDK: The intent space of the application is very limited, and it's very easy to quantify it in from of function calls and arguments. The utterance space is vast and dynamic as it will change between cultures and languages. It also needs to be evolving with the interaction with the users as it learns the behaviours of the users. If the utterance to intent binding is done within the amazon platform in a fluid way it has huge potential. You would be able to write application that are supported across multiple language/cultures without knowing much about them. As indicated by "N. Fradkin" Here is one potential that I can see :The application should provide clear definition of the intents but only provide some basic utterances, after that the platform can create additional dynamic utterance schema as there are interactions with users. This will translate to much better user experience.