Computer voices

If you are a geek like me, and you hear voices in your head, are they computer generated voices?  No need to answer, but I have been thinking about computer generated speech recently.

A few weeks ago I remembered that my dad had brought home a 45 rpm record when I was a kid and he was very excited to say that it was the first recording of a computer singing a song.  I may actually have the record, but the recording is still available on the web and it has an interesting history.  It turns out that it wasn’t IBM who made the first computer sing, but rather a researcher named Max Mathews who worked at . . . AT&T Bell Labs, which is where I worked as a contractor before leaving to start Gold Systems.  A shiver just went down my spine to think how my dad handed me that record so many years ago, and now I’ve worked at Bell Labs and continue today to be involved with speech recognition and applications that use speech synthesis.  And to this day, my company works with Avaya, the grandchild of AT&T.

But the coincidences continue – my friend Verna sent me a link to what may be the first scanner that plays music.  Mostly because she knows I’m a geek and a musician, so what could be better?  Why do people spend time doing things like this?  Because they can and who knows what it will lead to.  I’m sure people asked Mr. Mathews why he was wasting his time and what must have been very expensive computer resources to get a computer to sing a song.

And now for the best coincidence of all.  A few weeks ago my sister and I were trading songs names and memories via email.  She remembered an old favorite song book, which I found used on amazon.com and ordered just a minute or two before she was ready to send it to me as a gift.  This book was published in 1952 and it has a great foreword that talks about the songs that were sung in America around the Civil War and the turn of the century.  (And to me, the turn of the century still means going from the 1800’s to the 1900’s)  I received the book a week later (thanks amazon.com and an independent book seller somewhere!) and started to flip through it.  Guess what the VERY FIRST SONG was in the book?  Daisy Bell (A bicycle built for two) by Harry Dacre.  That’s the song that the computer sang, thanks to Bell Labs, that I listened to as a kid and wondered about what other amazing things a computer could do.

Here’s a site with more information on the song. According to that page, Alexander Graham Bell also used the song in a demo, so that’s probably where Mr. Mathews got the idea to use it in his research.  Here’s more on Mr. Mathews – he looks like a fun guy!  And I just found a link to the very same recording I listened to as a kid.  Scroll down until you see "Daisy".  The singing only happens after a nice long intro by a second computer that was generating the music.  You see, it does matter what music your kid listens to.

Tech-time flies

My friend and Editor-at-Large, Verna Wilder, writes about her first job as an operator at The Phone Company in San Francisco.  Verna always tells a great story.  It wasn’t that long ago that long distance calls were exotic and expensive.  Another friend of mine was telling me about being in college and how his uncle would allow my friend’s girlfriend to use his phone on Sundays so they could keep in touch.  He said the thought of it would help get him through the week, and he wonders if college students today are missing something with their instant communications.

I’m still fascinated by the idea that by dialing a series of numbers, I can make a bell ring almost anywhere in the world.  Yes, I’m easily amused.  Of course now I don’t have to "dial" or punch buttons – I can just say the name of the person I want to talk to (assuming they are in my contacts) and I’ll be connected.  Hmm, I think we’ve just come full circle, haven’t we? 

Crank_phone

First post for the Kauffman Foundation and a quick update

I just made my first post to the Kauffman Foundation’s eVenturing blog.  If you are interested in entrepreneurship, I hope you’ll subscribe to their feed.  Ken Berlack does a daily post of interesting articles, blogs and news about entrepreneurship, and he just did a series of posts about his time last week at DEMOfall ’06.

I’ve made some more progress on the FJ Car Computer project.  Streaming Internet radio (lookout XM!) and Virtual Earth satellite overlays on the navigation screen.  More details later.

Last week I went three days straight without touching my keyboard, except to login first thing in the morning.  The rest of the time I used the new speech recognition capabilities built into Microsoft Vista, the next version of Windows.  I answered all of my emails, wrote a couple of documents, installed software and (get this) even edited my computer’s registry – all with my voice.  I can still probably type faster, but it would be a close race.

Finally, this weekend I wrote some code for the first time in a long time.  It was a lot of fun and it might even be useful.  More on that in another post.

Funny school automated system

Boulder’s local radio station, KBCO, played a great spoof of an automated school information line.  This one clearly works better as a touch-tone system.  If you know a teacher, you’d better send them this link quick because it is sure to make the rounds.

I don’t know how long KBCO maintains their archives but it’s working right now:  http://www.kbco.com/pages/bcomorningshow-what.html?feed=105556&article=687176

Some Gold Systems trivia:  We once leased space in a couple of buildings on Riverbend Road, and actually had the old "Studio C" space for awhile.  Lots of great musicians have played there and KBCO has used the proceeds of their annual Studio C CD sales to donate over $500,000 to the Boulder County Aids Project.

Gold Systems’ customer mentioned

Gold Systems’ very first customer on the Microsoft Speech Server was mentioned on Ken Circeo’s blog, and I’ve been slow to acknowledge the story.  He tells the story better than I could, but he doesn’t mention the customer’s name.  Since they allowed us to write a case study about their experience, I think it is OK for me to say who it is, plus I’m happy to give the company some publicity.  ServiceMagic is the customer, and they have a great business.  If you are looking for a contractor to do some work around the house, and you would like to actually get a call back, try ServiceMagic.  (A contractor came to my house today, and not only did he return my call but he WROTE STUFF DOWN.  He’s destined for success.)

If you want to read more about what we built and why it worked so well, our case study is here in a large pdf.  To everyone at ServiceMagic, thank you for being a great customer and I wish you continued great success!

Thank you Ken for mentioning this story, and I believe you SHOULD get that new Tablet PC.  The future of Speech Recognition technology practically depends on it!

Do we really need more best practices?

SpeechTEK 2006 started out on an interesting note this week, since the first keynote speaker was Paul English, the guy behind www.gethuman.com.  You probably have heard of him even if you aren’t in the industry – his website lists all the automated telephone systems that he can find and tells you what to press or say to get to a live person.

Surely this is the number one frustration for most people who encounter an automated system.  I’ve always advocated to our customers that they should make it easy for people to get out to a live person.  I’ve stood up on soapboxes at conferences for years and begged the people making these decisions to not assume their customers are stupid, because even before www.gethuman.com most people could figure out how to get to a human.  The easiest way, unless you were tied tightly into a relationship, was to just hang up and find a different company to do business with.

It turns out, thanks to Paul and others, that the industry is taking notice.  We’ll see if it takes, but Paul has suggested that there be a standard for letting people know how to escape out to a live human.  Microsoft and Nuance have already pledged support and I expect we will too very shortly.

I was invited by Opus Research to be on a panel discussion called CEOs Survey: No Smooth Seas Here at SpeechTEK 2006, the speech recognition industry’s event where vendors and customers come together to talk about and learn about speech recognition.  Sure enough, one of the discussions was about how the industry needs to collaborate and promote Best Practices.  My comment was something like, "I think we all know how to build great, or at least very good, speech recognition applications.  The problem is we’re not doing a good enough job convincing our customers to always implement the best practices."  You see, as developers of these systems we hate it when a system is held up in public as being unfriendly or hard to use.

I’ve written about this before, but I have renewed hope that we can keep making the caller’s experience better.  I encourage the analysts and people like Paul to keep giving our customers concrete evidence that it is better for business in the long term if they let us build easy-to-use systems for them.  I encourage the people who are buying these systems to add a little CST (a term I heard from Tim Walsh at Walsh Media).  CST stands for Common Sense Technology. Ask yourself, would you want to use your system?  If the answer is no, then your customers won’t either.  If you’re not sure how to fix it, [shameless plug for Gold Systems here] then call me.

Microsoft Speech Recognition and Unified Messaging

 

This is a longer post than usual – it’s about Microsoft’s latest speech recognition demo of Vista, Exchange 2007 Unified Messaging and my experience this week as a surprise guest in Microsoft’s keynote address at SpeechTEK 2006 in New York this week.


I’ve been using Vista, Microsoft’s next operating system to be released in 2007, for about four months.  I immediately tried the built-in dictation software and was blown away by how well it worked.  Out of the box, with NO training, it performed better than anything I’d ever experienced and the editing capabilities for the first time (for me at least) made voice control of the PC intuitive and workable.


So . . . I was surprised and disappointed for my industry when I saw the video that circulated last week of the demo crashing and burning right before the eyes of the financial community.  If you haven’t seen it, I’ll spare you the pain by not linking to it, but it was clear that something was very wrong.  My Vista builds were much older and I had experienced for myself recognition that was very different from what I saw in the video.

 

It turns out there was a bug in the audio subsystem that was introduced at the last minute, and killed just as quickly, but it did its damage by once again making people think that speech recognition is never going to work.

 

Now . . . what a difference a week makes! At SpeechTEK 2006 in New York this week and I witnessed for myself the very same demo, and it worked PERFECTLY!  Microsoft even had the guts to joke about the previous failure, “taunting the demo gods” as one journalist put it, and still I expect they made a bunch of people (albeit industry people) believe that we have entered a new era for a technology that has been a long time coming.

 

I was not an uninterested bystander.   Richard Bray, who gave the keynote address on behalf of Microsoft and who runs their Speech Server group, invited me to demo Exchange 2007 Unified Messaging for Microsoft during his keynote address.  I’m pretty comfortable speaking to groups of people, but this was practically my entire industry and we were going to use live systems to do a live demo in real time.  No recorded demos – no net, just a telephone and a chance to either make a good impression or look like an idiot if I screwed up.  I knew the technology worked, because I’ve personally been live on Microsoft’s Exchange 2007 Unified Messaging product for about four months, but I also knew from experience that it would be easy to misspeak or have an AV problem that could hose up everything.  I had also heard that others had not had the best experience in the same room earlier in the event.

 

The keynote started with Rich talking about Microsoft’s investment and long history in speech recognition.  He then introduced Rob Chambers to do the Vista demo.  I admire Rob – he looked really cool and confident as he walked up to the stage. The dictation recognition was perfect and it understood everything Rob said.  He showed how easy it is to change and edit a document and then moved on to controlling the PC with just his voice.  He received several rounds of applause, especially when he changed his wallpaper from the standard Vista wallpaper to a photo of his young son without ever touching the keyboard.  (I’ll bet you’d have to think about how to do it even with the keyboard, which is what was cool about that part of the demo.  He said something like “How do I change the wallpaper?” and Vista walked him through it, all with only his voice.  You’ve got to see it to believe it.  I hope when THIS video makes the rounds that it is half as popular.) When he finished his demo of Vista speech recognition, my first thought was to high-five him for doing such a great demo.

 

My second thought as Rob walked off the stage and Rich began to introduce me, was “Holly crap, I’m up and if I screw up, I’M going to be the guy in next week’s video making the rounds.”  Rich put me at ease by surprising me by starting this part of the presentation with a photo of my FJ Cruiser project and asking if I really had installed an Xbox 360 in it.  I said something like “I’m not sure which is more embarrassing, that I have installed an Xbox in the FJ or that I’ve had to admit that I’m from Boulder, Colorado and I own an SUV.”

 

I then jumped into the demo and showed how in addition to email; I can now access my voice mail and faxes via Outlook and Exchange Server 2007 with Unified Messaging.  For the demo, I used Outlook Web Access which allows me to access email via a web browser.  We listened to a voice mail from Clint Patterson (another jokester) who suggested that “I needed more cow bell” in the demo.


By the way, this was a demo, but it was my real inbox and our live Exchange Server 2007 back in Boulder.  No one behind a curtain and nothing faked and it will be delivered as part of Exchange 2007.  The idea of Unified Messaging has been around for years but it has typically been an integration of a legacy voicemail system, an email server and an add-on to an email client.  With Microsoft’s approach, Exchange 2007 IS your voice mail and while pricing has not been publicly announced, the phrase “Radical Economics” has been tossed around by the analysts.  It means that I now have only one place to go for my office voice mail, my cell phone voice mail, my email, my faxes (I do still get a few faxes and they are usually something I don’t want sitting out on a public fax machine) AND I have only one login.  My IT people love it because they don’t have to manage separate systems and separate directories.  I know some of our customers are justifying the upgrade to Exchange 12 with the savings in maintenance charges on their legacy voice mail systems.

 

Back to the demo – After listening to the voice mail, I walked across the stage to a plain old telephone and dialed my number at Gold Systems.  I logged into Outlook Voice Access and was given the options of listening to my voicemail, listening to my email (and I can respond by voice to anyone but Brad, who hates voicemail), calling anyone at Gold Systems or (I really like this) anyone in my personal contacts.  Finally, I can do some very interesting things with my calendar which is what I showed next.

 

I said “Calendar” into the receiver and the system replied with something like “You have a meeting in progress entitled SpeechTEK Keynote address with Richard Bray from 8:30 AM to 10:00 AM.”  It offered some options but I interrupted and said “Next.”  It said, “Your next meeting is entitled “Breakfast with Clint Patterson” and again I interrupted and said “Cancel the meeting.”  I was asked to confirm that I really wanted to ditch Clint for breakfast, which I did, but I accepted the offer to send a voice message along with the cancellation notice.”  I responded “Clint, I have no idea what More Cowbell means and I told you they wouldn’t get the joke.  I’m going to have to skip breakfast as I’m still on stage here with Rich.”  I pressed a button on the phone to indicate I was done talking (maybe I could have just stopped talking?  I don’t know, I haven’t tried it) and I said “Send it with Priority”.  The meeting was canceled, Clint was sent an email with my voice note and my calendar in Outlook was updated.

 

For the next meeting on my schedule, I interrupted the system again and said into the phone, “I’ll be 30 minutes late”.  I know a LOT of people who could use this feature!  (I’m fine with you being late occasionally but let me know, OK?  Now you have no excuse if you are on Exchange 12 with UM)  This time the system sent out meeting notifications saying that I was running 30 minutes late.  I hung up the phone, showed everyone how my calendar had updated automatically, made a few last points and I was done.  Whew!  I was the only non-Microsoftee on stage and I was grateful for the chance to be a part of the keynote.  The other demos went perfectly too and for the rest of the show I felt elated to have been a part of it all.

 

The big announcement at the keynote is that Microsoft is merging what has been known as Speech Server 2007 into what was formally known as Live Communications Server to create the Microsoft Office Communications Server 2007.  I’ll write about THAT in another post, but it’s big news and is going to create a lot of opportunity in the industry.

 

I know I sound like I’m drunk on Microsoft Kool-Aid, and I am a little tipsy from it, but this really is big news and I think it will be good for the industry.


To everyone but my competitors, even most of Microsoft’s competitors, this is going to be good for business because it is going to extend speech recognition throughout the enterprise.  The world of communications in general is going to grow and change in fundamental ways, and a lot of people will benefit from Microsoft’s massive investment in this world. 


To Gold Systems competitors specifically:  Pay no attention to this and keep doing what you are doing. No one is ever going to trust their voice mail, phone calls or important business to a PC.  After all, when was the last time you rebooted your mainframe?  Just keep repeating that and maybe this will all go away.  But I doubt it.

 

 

Shouting doesn’t help

Stephen Potter tells a funny story about a person shouting at a speech recognition system, and why that generally doesn’t help with recognition issues.  For the people reading this who are not in the business, I’ll explain a bit.

When you encounter a telephone speech recognition system and it doesn’t understand you, there is a good chance that what you said was recorded and saved in a database.  Reports are generated that (if looked at) will tell the designer where people seem to be having problems, and then the designer can go back and listen to individual recordings of people actually trying and failing to be understood.  Often we hear the person coughing, or talking to the dog or something like that.  Sometimes they just misunderstood the prompt and said something that we weren’t expecting, and that can be a help in improving the prompt or changing the system to recognize what was actually said.

When I started in this business, everything was touch-tone, but you could still get clues about the application by "service observing the call".  People often talked to the touch-tone systems but by that point they usually were not being very nice.  I DO NOT WANT TO PRESS STAR TO HEAR THE MENU AGAIN, I WANT TO TALK TO A #@!@ PERSON YOU STUPID MACHINE! was a pretty common phrase.

When speech recognition capabilities first appeared commercially, their recognition ability was limited to recognizing the numbers zero through nine and the words yes, no, and oh.  A few years later we had the ability to build very limited vocabularies ourselves of words like Sales, Service and Operator.  We soon learned to include synonyms for words like Operator, so that if a person said agent or human we could still understand them and transfer them to a person.

I believe there was a market for a vocabulary of all the words and phrases that a person might say when they are frustrated with talking to a machine, but sadly we never built that product.  So to paraphrase Stephen, yelling and cursing at the machine may make you feel better but it won’t help get you to a person any faster.

The world dealing with Speech

We’ve had a speech recognition auto-attendant at Gold Systems for about five years I guess.  While you can still call people directly and you can even talk to a very nice person if you prefer, a lot of people use our V-Dialer to reach us.  Most of us have put the phone number in our email signatures, so my signature says something like Call V-Dialer and say "Terry Gold." 

Yesterday I was walking by the area where FedEx drops off packages, and since I’m always on the lookout for new gadgets showing up, I was particularly interested in a server-size box.  The label made me laugh out loud, because the package was addressed to "Say Jerry Loui"  I guess a human somewhere had a recognition problem.  I’ll be curious to see if Jerry now starts getting junk mail addressed to "Say Jerry Loui".

Sayjerryl_1
 

Who owns good design?

Marshall Harrison, one of the guys behind gotspeech.net commented on my Good Speech – Bad Speech post last night.  In my post I described a particularly ugly prompt I’d seen in a speech recognition application and called on people to write better applications that keep the customer’s needs in mind at all times.

Marshall has a similar soapbox as mine and one day we’re going to drag them down to the Pearl Street Mall and preach together about the need to create applications that are easier to use and that, (dare I say it) People Will Love.  Marshall points out (correctly) that bad application design is often not the developer’s fault.  The reason I’m doing a whole post on a comment is that a light bulb just went on for me.  The design of speech recognition telephony applications is often widely separated from the development of the application. 

By the time the developer is assigned to a project, the user interface design has often been fought over for months by people who don’t have user interface design experience.  They know what they want (usually) but then they make the mistake of drawing out the design on the whiteboard, a napkin or in Visio.  Visio is the worst, because it makes the design look official and unchangeable.  People who wouldn’t think of trying to tell you how to design a complex database application or desktop user interface, feel very comfortable specifying the voice user interface for you for some reason.

So what do we do?

  • I think that regardless of what role you play in speech application development, you have to realize that you are signing your name and staking your reputation on how well the application ultimately works.  If you see bad design, speak up and propose a a better alternative.
  • Try to get invited to the party earlier in the process.  What starts out as "we need to figure out what we want before we get the developers involved" doesn’t work here so well.  The sooner we get involved the better the application will be.  Yes, it means you may have to attend some business meetings, but trust me, learning about business can be good for a techie’s career.
  • Find some good examples that can be backed up with concrete numbers to help make your case for good design.  Business people respond to hard numbers about saving money and improving customer satisfaction.  We built an application a few years ago for a customer who had a goal of automating 10% of the calls.  The application ended up automating 51% of the calls and customer satisfaction and CSR satisfaction actually improved.  Do you think the people funding the project listened a little closer to our silly design recommendations once they realized the impact they were having?  You bet they did!

To sum it up, we’re all responsible for good design, and if we can do a better job of it this industry will grow even faster. 

Good Speech – Bad Speech

In January I said that Speech Recognition was in the trough of disillusionment.  I predict that by mid next year though, we’ll be viewing the technology very differently.  Moore’s Law and clever programming are advancing the science of speech recognition very quickly I believe.  People are going to be surprised at how fast this technology is going to improve.

However – no amount of technology will make up for bad application design.  I saw what may be the ugliest prompt ever a few weeks ago.  "Press or say eleven . . ."  I assume there were ten other choices before this prompt and I’m not positive that this was the last prompt.

Please people – I know it is fun to create these applications, but take some time to think about the poor user.  They don’t want more choices, they just want to get the information they need and then get off of the phone as quickly as possible.  And for goodness sake, give them the option to speak to a person if they can’t figure out how to get what they need from your application.  They’re going to find a way out eventually and there is no point in making them mad as they try to figure out the secret back door.

I’ve been on this soapbox before, but I’m not putting it away until the customer is treated like the important person they are instead of like some rat to be herded through a maze.