question

jjaquinta avatar image
jjaquinta asked

Ambiguous matching

I've found something that has helped get over the trouble I've had with Alexa matching the phrases I need to run my app. The first time through I had lot of utterances that specified each and every input that was valid. The trouble was that after a certain number, Alexa just takes it as general guidance. It would pass in anything that vaguely related to what was said. Second time was right after they introduced custom slots. Now I could specify the exact values that were valid. Unfortunately Alexa still will pass in values that are not on your list. >_< Crazy, but there you go. To get around this, I had a multi-pass matching system. First I would look for an exact match between what the user said, and what came in. Then I would look for an "ordinal match". I.e. if the user said "first, second", etc. The third, and last resort, was to just compare the first letter. But today, while researching IPA and phoneme generation, I discovered specific algorithms that were created for matching things phonetically. Note: not converting them into phonemes, but doing a phonetic match. The oldest is Soundex which goes back to 1911. There are also a bunuch more recent ones, and it is pretty common in the industry. The older ones were pretty simple to implement. But I thought I would look to see if someone else had done it first. Not only did I find an Apache library with many of the more modern ones done, I even discovered I [i]already had the library[/i] in my skill! So, for city matching, I encode the word, and then test the word against each encoded option. Here is a snippet using Double Metaphone: [code] public static DoubleMetaphone mDoubleMetaphone = new DoubleMetaphone(); public static String doubleMetaphoneMatch(String word, List vocab) { String wordEncoded = mDoubleMetaphone.encode(word); for (String w : vocab) { String wEncoded = mDoubleMetaphone.encode(w); if (mDoubleMetaphone.isDoubleMetaphoneEqual(wordEncoded, wEncoded, false)) return w; if (mDoubleMetaphone.isDoubleMetaphoneEqual(wordEncoded, wEncoded, true)) return w; } return null; } [/code] These algorithms work by digesting down the word into a couple of characters. Common consonants (e.g. b, f, p, v) are collapsed together. In my second use case, I had to work out what type of drone the user said. Since "attack drone, attacker drone, attacker, etc" all get encoded the same, it was simpler to just compare against a pre-encoded list. [code] private static final String[][] DRONES_METAHPONE = { {"ATKT","ATKR"}, {"TFNS","TFNT"}, {"FKTR"}, }; private static Integer parseDrone(Slot d) { String drone = d.getValue(); String txt = PhoneticMatchLogic.mDoubleMetaphone.encode(drone); for (int i = 0; i < DRONES_METAHPONE.length; i++) for (int j = 0; i < DRONES_METAHPONE[i].length; j++) if (DRONES_METAHPONE[i][j].equals(txt)) return i; return null; } [/code] So now, when Alexa throws something at me boneheaded like "attack drums" instead of "attack drones", or "donkey" instead of "Doncaster", it matches it right. [b]References:[/b] Soundex: https://en.wikipedia.org/wiki/Soundex Apache Commons Codec: http://commons.apache.org/proper/commons-codec/archives/1.7/apidocs/index.html
alexa skills kit
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Steve A avatar image
Steve A answered
That's cool. Thanks for passing it along. Did you happen to find anything that *did* do phoneme generation? I'd love that! I've been interested in using SSML for pronunciation of foreign words, but haven't been able to find resource that would give me an IPA transcription of a foreign words. (There seems to be a couple of resources for a very select few languages -- English, in particular -- but nothing that would work with an Alexa skill.) So, while the implementation of SSML seems to have huge potential, and is helpful for fixing up the pronunciation of a few discrete words, I don't see how to use it to, e.g., allow Alexa to speak passable French, or Spanish, (or Telegu, etc.)
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

jjaquinta avatar image
jjaquinta answered
That's what I was researching last week. :-) I found a number of web sites that do Text->IPA in a number of languages. I wrote a playback skill for Alexa where I could enter in text to a web page (including SSML) and it would read it back to me when I invoked the skill. In short: it was terrible. I'm sure you've pasted French into an English TTS engine. It's [i]almost[/i] like that. It sounds like an American speaking High School level French. My wife says it was... understandable. But not really the quality you would want in a skill. The Italian and German sounded as bad. Chinese was unintelligible, even with the IPA symbols for tones. I would have though it would do Japanese much easier, since the pronunciation is so simple. But it was still pretty cringeworthy. Tellingly, when I converted English into IPA, and got Alexa to pronounce that, it also sounded bad. Clearly Alexa is doing a [i]lot[/i] of post processing to make the raw phonemes sound like a human speaking. When you go down to IPA you lose all of that. This makes the feature almost useless. So I can go find the links if you want. But I really don't recommend it. Right now my ambition is to create ONE WORD (Captain) that sounds like it's being pronounced in a French accent. Figured it would add color to the narrator in Starlanes. But I'm going to have to learn IPA and piece it together by hand. This week's research turned up "espeak". (Google it for a link.) That's a stand-alone program to read text, but there are some command line switches you can use to make it convert to IPA. I'm pretty sure it has the same faults though. Let me know if you have more success than me.
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Steve A avatar image
Steve A answered
Thanks. That's helpful. I'm going set up an espeak server for Alexa to query to get the IPA for utterances. I'll let you know if the results are significantly improved. Steve
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

jjaquinta avatar image
jjaquinta answered
Oh, if it's at all tolerable... IBM have made some of their language translation engines available as Bluemix services. You could wire them up and create a audio translation skill. I did this many years ago when I was working with machine translation. The cumulative error rate between TTS, translation, and STT was be horrendous. I'm not sure it would be much better now. But it might good for a laugh.
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Galactoise avatar image
Galactoise answered
My team did a hack week project about a month ago that did something really similar. Our idea was to build a skill that could query various Wikia wikis for information, and our primary PoC was on the Star Wars wiki (Wookieepedia). The interesting thing about the Star Wars space is that a huge percentage of their words are made up, so we had to figure out a way to take the English words that Alexa was hearing and convert them to our actual proper nouns from the SW universe. We're planning on doing an in-depth writeup at some point, but the short version is that we had a super awesome disambiguation step that used double metaphone, like yours, and then did a Levenshtein distance calculation against the known space of article names indexed by their double metaphone values. If we came up with a single match that stood out, we took that one. Otherwise, we fell back on some other disambiguation techniques. One thing that we noticed during this project was that it would've been super helpful if we could've added extra attributes to the slots we defined. We had certain slots that we knew were going to be ambiguous, and others that would never be ambiguous. We didn't want to have to do a hard mapping of slot names requiring post-processing in our service, so instead we ended up prefixing our slots with something like "resolveMe_". Then whenever we saw that string on a slot name we'd send it through our phonetic stuff and rename it to strip the prefix. It would've been much easier if we could have just had a boolean flag passed with the slot, though...
10 |5000

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.