question

Sushil Singh avatar image
Sushil Singh asked

A developer's feature request (Purely my views on my experiences with Echo)

I posted a detailed feedback blog at http://sushilks.blogspot.com/2015/08/amazon-echo-sdk-developers-review.html Here is one of the feature I would like to request that's some what easy to add. (These are purely my views.) - Location Awareness I was talking with some friends and they asked me what's the difference between Echo and Google-voice. From my interactions the primary difference I see is that Echo is not tied to a person like the cell phones are. Instead it's tied to the home, it has a location where it exists and does not move. It's subtle but a very important distinction which changes the behavior of the people interacting with it. Everyone in the house knows where Echo is and when referring to it they will look in that direction. However it's a pity that current Echo itself is not aware of this. I say this because I have multiple cho-devices in my home and I am unable to give them some context of where they are, for example If I am in my bedroom and I say "Alexa turn on the lights" it's obvious that I am referring to the lights in my bedroom and not in the living room. Instead the system requires me to call my bedroom light as 'bedroom-light' and my living room lights as 'living-room-light' this I find very limiting and annoying. In the future I would want the device to be aware of the location where it's installed and understand what's within the room and what's outside of the room. There are also restrictions that need to be placed, as one room should not have control of devices in another room.In the short run a lot can be achieved if the different devices in the home are allowed to have their own groups. When naming devices in the room I would like to use a hierarchical name i.e. bedroom/front/light, bedroom/main/light. This way when I am talking to the Echo in the bedroom and I ask to turn on the light (“Alexa turn on the light”) it can just find the closest match within the room and turn on all the items that match(In this case both the light). I should also be able to say "Alexa turn on main light" to turn on specific light. When in another room I would need to say "Alexa turn on bedroom main light" to address it specifically. -Ability to identify people. There is another limitation with this device, it's unable to determine who it is talking to. This prevents it from personalization of any of the contents. There are many people that live in the house and they will all interact with the device the interaction can be very rich if it was personalized to the person. -Support for sending asynchronous event to Echo There is no mechanism to activate Echo from code running outside i.e. You can't write a program that makes Echo ask a question. Some one has to talk to Echo first before It can activate a program. There are security concerns for this and it's not an easy feature to design the API's have to be well thought out. But would be nice to have it in the future. - Language processing capabilities in the SDK. I also spent some time writing apps using the SDK. After a few iterations with application writing, it left me feeling that there is something missing with text processing. It's very good at parsing what was said, but there is a missing framework to extract the meaning of what was said. Designing the application around specific "utterances" are very limiting, it makes the application very rigid and fragile. Different people have different ways of saying the same thing so its very easy to break the application. This may be fine for very basic tasks, but as the complexity of the task increases it becomes hard to interact with the device.
alexa skills kit
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Ben avatar image
Ben answered
I agree, I have 3 echo's in my house but can only use 2 of them due to the limited number of wake words. My solution to location awareness is to use numbers sets for each room. My bedroom ranges from 0 - 10. 0 = all devices in the room. 1 = First bed lamp 2 = Second bed lmap 3 = Lava lamp For the living room the range is 20 - 30, so all devices are called 20 then I start going by individual numbers for each device. This room numbering structure also allows me to give common numbers, for instance the roof lights being number 5 in bedroom and 25 in living room. Oh I don't actually call the device "1", I call it "One" using the word to specify the number.
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

jjaquinta avatar image
jjaquinta answered
The first three of these are some of the most requested features for The Echo. [i]My[/i] personal opinion is that they will not do these things for a variety of technical (identity) and privacy (location, push) reasons. Your fourth one "Language processing capabilities in the SDK", is new. I think the technical problems with this are the same as identity. I.e. that the underlying process is inaccurate, as is, so it isn't likely that even more information can be pulled from the same data. I looked around for a 3rd party toolkit to do something like that for some unrelated patent research and didn't find anything. So, even dismissing the problem with recognition accuracy, I'm not sure this is even possible. For contrast, you might look at some of the Watson services provided on IBM bluemix. I hooked up their Q&A service and did a video here: https://www.youtube.com/watch?v=ufrGo_JUeEg The closest thing that comes to what you want is the Natural Language Classifier. https://console.ng.bluemix.net/catalog/natural-language-classifier The Dialog one also seems interesting, but would probably run afowl of utterances: https://console.ng.bluemix.net/catalog/dialog I also don't think utterances are quite as limiting as you paint them to be. Not that there aren't many ways to improve them, but there is still a lot you can do with them. I developed what the cert team called the most complicated skill they had seen by a methodical design practice. First I wrote down a large number of ideal conversations. These gave what I considered to be the perfect interactions between Alexa and the user. I did a pass through these to identify the slots and group them into utterances. Then another pass to group those utterances into intents. Then another pass to construct a state machine table to drive the business logic for sequencing the intents. In the end I was able to incorporate the functionality I wanted within the limitations there. It's a process that should be able to work with just about any application. I walk through these steps in more detail in my book ( http://www.amazon.com/How-Program-Amazon-Echo-Development-ebook/dp/B011J6AP26 ), but the above is the gist of it.
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Sushil Singh avatar image
Sushil Singh answered
Hi jjaquinta Thanks for the feedback, I agree with you about utterances being very capable, The part that bothers me is that as a developer you have to think a lot and come up with all the combinations that makes sense. With careful design this can be a addressed but there is a cost to it and once the application is designed and deployed the predefined structure remains. Also based on geography etc the application will need different utterance since folks in different regions say the same things in different ways. This topic is very interesting to me so I have been spending some time researching on it. I am not an expert on this field and initially the problem seemed too difficult to address. However after some investigation I found many tools that are available to address this problem in the open source community. Here is what I am trying to investigate:- As the 1st step: What I would like to achieve is that when a command is send to alexa, it deconstructs the grammar behind the command i.e verb, subject, object etc. After that it should be able to extract what it is being asked to do using the verb/sub/obj structure, the order of the words etc are handled by the NLP engine. I have been playing with this concept and was able to use stanford NLP libraries to put a prototype together. I will post the code for it as soon as I can clean it up. As a 2nd step: I want the ability that when a command is not understood by Alexa it should be able to ask the user what is the intent behind that command in form of some question/answers. Once the user provides the answers it should be able to learn the command so that next time similar commands are understood by it automatically. Message was edited by: Sushil Singh Message was edited by: Sushil Singh
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Sushil Singh avatar image
Sushil Singh answered
One more comment on the First Item "Location Awareness" I think there is a low hanging fruit here, If there was a way to create groups that are local to each echo device a large portion of this can be achieved. I can create the same group by the same name on different devices and have them take different actions. i.e. group "light" in bedroom will point to bedroom-light group "light" in living room will point livingroom-light.
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

jjaquinta avatar image
jjaquinta answered
>was able to use stanford NLP libraries to put a prototype together Good luck with that. I did the same thing when I was interviewing with The Echo team, convinced there was a better way of doing it. I thought something like was done with LambdaMOO would be useful and these libraries could help. Unfortunately all the libraries really did was break the sentence into a tree and tag each word with a type of grammar. They didn't really do the heavy semantic lifting. So I gave up and decided to go with the limitations of the system. The closest I've come professionally to NLP was with machine translation. One of the big issues there is that it works really great if the input is really great. But, like with so many things, garbage in->garbage out. The quality of Alexa's STT will introduce an error rate that will tend to throw this sort of approach seriously off kilter. The utterance schema they have may not be that sophisticated but it is pretty error sensitive and isn't that bad an approach given the current quality of Alexa's STT. >when a command is not understood by Alexa it should >be able to ask the user what is the intent behind that command I'm not even sure how you would do this, other than by magic. I can't even work out the intent behind the utterances of some people I know, and they have a full fidelity verbal interface! I can't imagine how a machine would do this. >One more comment on the First Item "Location Awareness" The limitations are not technical, they are with respect to privacy. It would be easy to do, but the blowback from customers wouldn't be worth it.
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Sushil Singh avatar image
Sushil Singh answered
Let me give it some of my time. It's possible that it may be too complicated or too difficult, But worth a try to see what's possible. Either case I will report back with some findings possibly in a week or so. I have some thoughts on the topic that I am working with but feel free to suggest anything specific(Tools or ideas) you would like me to investigate.
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

jjaquinta avatar image
jjaquinta answered
If you happen to have my book ( http://www.amazon.com/How-Program-Amazon-Echo-Development-ebook/dp/B011J6AP26 ), I'd love to see a tool that implements/supports the design process outlined in the Design section. I.e. it can take in a bunch of "ideal conversations", and then (possibly with user guidance) output a the intents and utterances. Possibly even a few EchoSim compatible test scripts for good measure. It could even stub out the code for the state machine needed to run the given conversations. That would be pretty cool and really accelerate skill development.
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Sushil Singh avatar image
Sushil Singh answered
Thanks I will give it a read, will ping you back if I have any questions.
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Sushil Singh avatar image
Sushil Singh answered
Hi, I created a working prototype to showcase a proof of concept for parsing sentence and for learning. The code is at https://github.com/sushilks/nlpUnitConverter, And the associated blog is at http://sushilks.blogspot.com/2015/09/understanding-what-is-being-spoken.html This is work in progress so a bit hacky but captures the concepts. Please have a look and provide thoughts/feedback.
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.