Earlier this afternoon, I participated in a connected car panel at SpeechTEK 2012, hosted by our friend Mazin Gilbert from AT&T. The other panelists included Greg Bielby of VoltDelta, Thomas Schalk of Agero, and Hakan Kostepen of Panasonic.
Even though Mazin did a fantastic job, not every panelist had a chance to answer every question. I was itching to answer some, so here are my responses to the questions that I didn't get to answer, or where I feel I could have provided a more complete response.
Have speech technologies matured to the point where they can be used robustly in the car? The general answer to this question from the panel was yes, but I think the real answer is a qualified yes. The technologies exist, but often aren't applied or may need auto-specific adaptations to handle in-cabin noise or other issues. Natural language recognition was an oft-stated driving technology, but a missing piece to the puzzle is hybrid recognition. I don't mean pushing recognition wholesale to the cloud, like Siri does. I mean a true split of the recognition effort, where each part does what it’s best at. Put the front half of acoustic processing in the vehicle to clean up the audio and convert the waveform to frequency-domain data, then send the data to the cloud-based server. The cloud server can then parse and interpret the data, and send back the result.
Hybrid speech rec solves three problems at once: better audio signals (the car can improve audio specific to the in-cabin environment), better cost (frequency data is far more compressed than raw audio, so you pay less for data transfer), and better responsiveness (hybrid rec gives the server time to start working on the response while it's coming in instead of waiting for the whole utterance to finish before starting).
Is driver distraction a major business driver, or is it the "Siri effect"? Currently, the car industry seems to use driver distraction as a reason to push a lot of features into speech. Many of those uses are gimmicky. Personally, I don't care if I can set my climate control system with voice — why would I when I can simply turn a dial? I once had someone ask me about the feasibility of adding voice recognition commands for rolling down the windows. I asked him, "Yes, but wouldn't people just push the window button?"
We shouldn’t implement speech commands just because we can. They may have contributed to excitement in the early adopter crowd, but we're beyond that now. Mind you, there are some seriously useful ways to use voice. For instance, any time you need to pick from a huge number of choices, voice recognition is the natural way to go. Calling contacts ("Call Sarah Potter"), entering destinations ("Go to 3121 South Park Street"), or picking music ("Play Audioslave") are all much easier than using an HMI to enter the same information, and safer to boot. It just has to work consistently and accurately.
Will car makers see more speech moving to the cloud, or will it be a hybrid of cloud and embedded? I disagree with the majority of the panel on this one, and, I think, the majority of people in the industry. Most auto people believe a hybrid between embedded and cloud allows the best of both worlds — good recognition and updatability when connected, and consistent reliability when not. My colleague Andrew Poliak also champions this view with a memorable catch phrase: Zombie Apocalypse. That is, you still want the system to work, albeit partially, when the infrastructure isn't available.
But if you ask me, everyone is missing the point — theirs is a technology-centric point of view. Everyday customer acceptance of a particular technology is notoriously harsh: if it doesn't work well, it gets rejected out of hand. Good cloud solutions beat an embedded solution hands-down; they just need some improvements (see my hybrid bullet above). Once a customer experiences a good solution, they will become frustrated with one that performs poorly. In my opinion, it's better not to offer the service at all, than to try a graceful degradation of capability, because most customers won't understand or care. Spend the effort instead on making sure you always have an acceptable cloud connection — either through multiple redundant mechanisms or a car-based powerful antenna — and you'll be better off. Even when the car knows some data that the cloud doesn't (like a mobile's contact list or music selection), there's no need to handle that on the embedded side. The cloud recognition server is powerful enough to not require the data set a priori. And I think we can predict an eventual migration of phone data to cloud-based data (or cloud-synchronized data) that makes the car's knowledge either easily transferrable or less relevant.
Who makes money, and how, from voice-enabled agents or voice services? This was one of the best questions of the panel, because nobody really knows the exact model, but everybody agreed that customer tolerance is very low. The most likely candidate is ad-based revenue. This doesn't mean reading ads aloud to the driver, but rather, positively influencing search results for either active or temporary situation-based points of interest (POIs). Depending on how valuable the service is to the driver, there will still be an option for service-based payments and high-value apps.
Standards and building mobile apps — will it come? You need standards if you want to build an app platform that will promote application creation and adoption. That's what we're doing with the QNX CAR 2 application platform — creating a way for someone other than the car companies to join the ecosystem and to deploy their apps to the car in a controlled way. But don't forget, you need a standard way to deploy apps for the cloud half of the recognition, too.
To close, let me share two photos. One was taken outside the Marriott Marquis, the hotel hosting the conference just off of Times Square in NYC. The other is from our PR agency, Breakaway Communications. What do they have in common? Wooden water towers. Sorry, I couldn't help myself; I just love those things. They just look so quaint in a city full of glass and brick.