Voicesense makes an intriguing promise to its clients: give us someone’s voice, and we’ll tell you what they will do. The Israeli company uses real-time voice analysis during calls to evaluate whether someone is likely to default on a bank loan, buy a more expensive product, or be the best candidate for a job.
It’s one of a crop of companies looking for the personal insights contained in our speech. In recent years, researchers and startups have taken note of the rich trove of information that can be mined from voice, especially as the popularity of home assistants like Amazon’s Alexa make consumers increasingly comfortable talking to their devices. The voice technology market is growing and is expected to reach $15.5 billion by 2029, according to a report by business analytics firm IdTechEx. “Almost everyone talks and there’s a plethora of devices that capture voice, whether it’s your phone or things like Alexa and Google Home,” says Satrajit Ghosh, a research scientist at MIT’s McGovern Center for Brain Research who is interested in developing voice analysis for mental health purposes. “Voice has become a fairly ubiquitous stream across life.”
Voice is not only ubiquitous; it’s highly personal, hard to fake — think about the incredulity surrounding the falsely deep voice of former Theranos CEO Elizabeth Holmes — and present in some of our most intimate environments. People speak to Alexa (which has erroneously recorded conversations) in their homes, and digital voice assistants are increasingly used in hospitals. Voice journal apps like Maslo rely on the user speaking frankly about their issues. By now, many people know that tweets and Instagram posts are going to be monitored, but fewer think about our voices as yet another form of data that can tell us about ourselves and also give us away to others. All of this has led to exciting research about how this information can enrich our lives, as well as privacy concerns about how accurate such insights are and how they will be used.
The key to voice analysis research is not what someone says, but how they say it: the tones, the speed, the emphases, the pauses. The trick is machine learning. Take labeled samples from two groups — say, people with anxiety versus people without — and feed that data to an algorithm. The algorithm then learns to pick up the subtle speaking signs that might indicate whether someone is part of Group A or Group B, and it can do the same on new samples in the future.
The results can sometimes be counterintuitive, says Louis-Philippe Morency, a computer scientist at Carnegie Mellon University who built a project called SimSensei that can help detect depression using voice. In some early research that tried to match vocal features with the likelihood of attempting suicide again, Morency’s team found that people with a soft, breathy voice, not those with tense or angry voices, were more likely to reattempt, he says. That research is preliminary, though, and the links are usually not so simple. Typically, the giveaway is a complex set of features and speaking patterns that only algorithms can pick up on.
Still, researchers have already built algorithms that use the voice to help identify everything from Parkinson’s disease to post-traumatic stress disorder. For many, the greatest promise of this technology sits at the intersection of voice analysis and mental health and the hope of creating an easy way to monitor and help those at risk of relapse.
People with mental health conditions are monitored closely when they’re in the hospital, but “a lot of what happens with mental health conditions happens in one’s daily life,” says David Ahern, who directs the Digital Behavioral Health program at Brigham and Women’s Hospital. He says that outside of a supervised setting, daily life can wear people down slowly and subtly. In that kind of situation, a person once diagnosed with depression may not even realize that they’ve become depressed again. “These events occur when people are unconnected to any kind of health system. And if a condition gets worse to the point that somebody seeks care in an emergency room, to use a Midwest expression, the pony is already out of the barn,” Ahern says. “The idea of having a sensor in your pocket that could monitor relevant behavioral activities is pretty powerful conceptually. It could be an early-warning system.”
Ahern is a principal investigator on a clinical trial of CompanionMx, a mental health-monitoring system that launched back in December. (CompanionMx is currently only available for physicians and patients. Other startups, like Sonde Health and Ellipsis Health, have similar goals.) Patients record audio diaries using the app. The program analyzes those diaries along with metadata like call logs and location to determine how the patient scores on four factors — depressed mood, diminished interest, avoidance, and fatigue — and tracks change over time. This information, which is protected by the federal privacy law HIPAA, is shared with the patient and also presented in a dashboard to a doctor who wants to keep an eye on how their patient is doing.
The company has tested the product for seven years and with over 1,500 patients, according to CompanionMx chief executive Sub Datta. The product, which spun out of another voice analysis company called Cogito, has received funding from DARPA and the National Institutes of Mental Health. Results published in the Journal of Medical Internet Research suggest that the technology can predict symptoms of depression and PTSD, though further validation is needed.
In pilot studies, 95 percent of patients have left audio diaries at least once a week, and the clinicians access the dashboard at least once a day, according to Datta. These numbers are promising, though Ahern points out that plenty of questions remain about which component is the most helpful. Is it the app itself? The feedback? The dashboard? A combination? Research is continuing, and other results haven’t been made public yet. CompanionMx plans to partner with health care organizations and is looking at opportunities with the Department of Veterans Affairs.
Meanwhile, services like Voicesense, CallMiner, RankMiner, and CompanionMx’s one-time parent company Cogito promise to use voice analytics in the business context. Most of the time, this means improving customer service engagement at call centers, but Voicesense has bigger dreams. “Today we’re able to generate a complete personality profile,” claims CEO Yoav Degani. His plans go beyond appeasing disgruntled customers. His company is interested in everything: loan default predictions, insurance claims predictions, revealing the investment style of customers, in-house candidate assessment for HR, assessing whether employees are likely to leave. “We’re not correct 100 percent of the time, but we are correct a very impressive percent of the time,” Degani says. “We can provide predictions about health behavior, working behavior, entertainment, so on and so forth.”
In one case study Degani shared, Voicesense tested its technology with a large European bank. The bank provided voice samples from a few thousand debtors. (The bank already knew who had and hadn’t defaulted on their loans.) Voicesense ran its algorithm on these samples and classified the recordings into low, medium, and high risk. In one such analysis, 6 percent of the predicted “low-risk” people defaulted, compared to 27 percent of the group Voicesense had deemed high risk. In another evaluation looking at the probability that temporary employees would leave, 13 percent of those the algorithm classified as “low-risk” left, compared to 39 percent of the high-risk group.
These are all plausible applications, says Ghosh, the MIT professor. Nothing jumps out as a red flag for him. But as with any predictive technology, it’s easy to overgeneralize if the analysis is not done well. “In general, until I see proof that something was validated on X number of people and this diversity of population, I would be very hard-pressed to take somebody’s claim for granted,” he says. “Voice characteristics can vary quite a bit unless you’ve sampled enough, which is why we stay away from making very strong claims.”
For his part, Degani says that the Voicesense speech-processing algorithm measures over 200 parameters every second and can be accurate on many different languages, including tonal languages like Mandarin. The program is still in the pilot stage, but the company is in touch with large banks, he says, and other investors. “Everybody is fascinated by the potential of such technology.”
Customer service is one thing, but Robert D’Ovidio, a criminology professor at Drexel University, is concerned that some of the applications that Voicesense envisions could be discriminatory. Imagine calling up a mortgage company, he says, and they use your voice to determine that you’re at higher risk for heart disease, and then you’re deemed a higher risk because you might not be around for a long time. “I really think we’re going to have consumer protection legislation created to protect against the collection of these,” D’Ovidio adds.
Some consumer protections like this exist, points out Ryan Calo, a professor at the University of Washington School of Law. Voice is considered a biometric measure, and a few states, like Illinois, already have laws that guarantee biometric security. Calo adds that the problem of biases that correlate to sensitive categories like race or gender is endemic to machine learning techniques, whether those techniques are used in voice analysis or looking at resumes. But people feel viscerally upset when those machine learning methods are used for facial or voice recognition, in part because those characteristics are so personal. And while anti-discrimination laws do exist, many of the issues surrounding voice analysis run into broader questions of when it’s okay to use information and what constitutes discriminations, which are concepts that we as a society have not adequately grappled with.
“I hope that as we move forward, we recognize that this is just data, no matter what form it’s in, like a bunch of numbers typed in a spreadsheet or voiceprint that’s captured,” says D’Ovidio. At a minimum, he adds, we should be demanding that we be told when something like this is being used. “And I’d want to see movement afoot to regulation in terms of protecting consumers,” he says. “What happens when the algorithms are wrong?”