question

Rand M avatar image
Rand M asked

Is there a performance impact as the number of utterance chars grows?

I've read that the number of sample utterance characters can be up to 200,000. With respect only to the text-to-speech piece (which Amazon owns): - Does performance decrease as the number of sample utterance chars goes up - it seems like it would, but I haven't seen this addressed - Does performance decrease as the number of concurrent users of a single skill increases? I would think not, but have to ask... - Does performance decrease as the number of concurrent users across [i]all[/i] skills increase (i.e. the global number of concurrent Echo users)? I hope not, but have to ask... Any thoughts on how to stress test the TTS piece? For example, using the pizza example (not my skill, btw) - what happens if tens/hundreds of thousands of people order pizza all day long on Labor Day from different parts of the country using different utterances? Testing the web piece of this is commonplace, but what about the voice piece - particularly concerning since there is an 8-second timeout?
alexa skills kitsubmission testing certification
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

jjaquinta avatar image
jjaquinta answered
There are skills in the proof of concepts samples on GitHub ( https://github.com/jjaquinta/EchoProofOfConcepts) with small utterance files (ups, watson) and large utterance files (kk, cc). I haven't noticed any performance difference between them. In any event, how well the system performs from the moment the word leaves someone's mouth, to the moment your skill is called, is Amazon's responsibility. It isn't your code. You can't affect it. You can't change it. So I wouldn't put it on your test plan. (And, Amazon is Queen of Scalable services. If they've implemented this with their own services, I doubt anything could stress them!) If you are worried about stress, the point at which to worry is your own application. If you are implementing a Lamabda function, well, those are supposed to be scalable. You do need to make sure you've designed them right. Sometimes you get a new JVM, sometimes you don't. So global variable caches can give the appearance of working sometimes, but they don't. If you want to persist or cache something with a Lambda function, you need to use a DB service. Read about them carefully. DynamoDB, for example, is "eventually consistent". Which means that if you do a read right after a write, you many not get what you just wrote. Design accordingl. If you are doing a Web Service endpoint, it is up to you to make sure that your business logic is thread safe. You can implement memory caches here, just be sure that you synchronize your access to them. And, if you have to synchronize other areas, make sure you can't run into deadlock. So there are a lot of things to consider for performance of your skill, but the TTS part is not one that I would spend any time on.
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Rand M avatar image
Rand M answered
Thanks a lot, Joseph. I thought about this after posting and realized there was only one question I could do anything about. Since invocation, whether for single skills or all skills (as well as single users or all users) would be managed by Amazon with appropriate scalability, the concern about performance effects under load can be negated (for now)... The only real question is whether the size of the utterance file matters with respect to performance and whether an effort should be made to limit this size? I have to assume Amazon came up with the 200,000 char figure based on acceptable performance, so this is mostly technical curiosity. I'm not familiar with the "utterance matching" technology used (or caching, pruning, balancing, etc, that might be employed), but in my mind there is some kind of searching/matching algorithm between what Alexa heard vs. the universe of sample utterances - leading me to believe that the less utterances the better performance would be. Of course, this could work in reverse if Alexa "missed" and then had to make assumptions, run algorithms, make service calls, etc, based on lack of utterances. When your Knock-Knock skill goes viral, hopefully you can let us know the performance metrics for a large utterance file in prod ;)
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

jjaquinta avatar image
jjaquinta answered
I'm not privy to how Amazon have written their text matching for their Alexa service, but I've worked with STT off and on over the years and have some guesses. Firstly, they are going to take your utterance file and first parse out all values for all slots. Then they are going to take the phrases and arrange them into a decision tree. Kind of like how you would do for a spell checker. I.e. phrases that start with overlapping sub-phrases share a parent node. Traversing the tree from the parent to any given node will result in a specific intent + slot values. Then, when the speech comes in, they will convert it first to phonemes, and then into hypothetical words. Each hypothesis has a confidence value. Most people think that speech recognition just magically works out what words to use. Those are the dumb ones. The smart ones (e.g. Watson Speech To Text on Bluemix) will return to you possible interpretations of the speech attached with confidence values. So, you have a tree of possible phrases for the speech. You have a tree of valid things that could be said. Alexa goes through the trees and finds the highest confidence rendering that machines with the highest precision possibility. Just how you balance the two is magic and is where they are going to be tweaking things. The general approach is pretty standard in the industry. The utterance preparation can be done at submit time. The speech analysis has to be done at run time. The resulting trees are going to be finite and will not take consequential time to match. So the bottleneck is going to be at the recognition level. That's up to Amazon to solve. The size of the utterance file isn't going to matter, it's the depth and number of branches to the tree it results in. And that's never likely to be that large. Assuming an even distribution, each doubling in size of your file only adds a single level to the tree. It's more complicated than that, but, really, I don't think it is going to be where the bottleneck is.
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

justin avatar image
justin answered
Hey guys, a quick note on the previous question: > Does performance decrease as the number of sample utterance chars goes up No, the larger number of sample utterances does not have a negative performance impact. On the contrary, it is recommended that the users provide a large number of possible utterance combinations. Quoting from the link below: "Try to provide several hundred samples or more to address all the variations in slot value words as noted above" https://developer.amazon.com/public/solutions/alexa/alexa-skills-kit/docs/defining-the-voice-interface#Recommendations for Defining the Sample Utterances Developers should also balance out their sample utterances across intents.
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.