Meet a Data Scientist: Dr. Joyce Cahoon
WiDS Puget Sound and Data Circles is excited to present the next entry in our series, “Meet a Data Scientist!”
“Meet a Data Scientist” is dedicated to recognizing the amazing women powering the Puget Sound area’s data science community, spotlighting their journey into the field, their incredible accomplishments, and the weighty challenges that they faced along the way. This lies at the heart of WiDS Puget Sound and Data Circles’ mission of inspiring women to enter the data science field by showcasing its many incredible role models.
Do you know any marvelous women in data science? Send us a tip here!
“I’ve always been a data scientist,” says Joyce Cahoon, PhD, a Senior Data Scientist at Microsoft. “At the end of the day, I wanted to make decisions based on evidence, not just based on the last anecdote I’d heard.” Dr. Cahoon remembers, throughout her life, always wanting to see the larger data. After graduating from Duke University with bachelor’s degrees in biomedical engineering and economics, she worked for a year on Wall Street as an investment banker. The lifestyle on Wall Street was not a great fit for her and she elected to pursue other roles.
Entrepreneurship, a passion of hers from earlier in life when she’d participated in startup competitions, is where she turned her attention next. She joined a startup called SMSmart, which made an app that allowed users to access sites like Yelp, Yahoo Finance, and Google Maps using just text messages—without internet access. Her team applied to Y Combinator (the company that coaches and funds early-stage tech startups). They advanced quite far in the competition, and found they had a steady user base (surprisingly in the trucking and transportation industry), but ultimately they disbanded when they were not selected in the final rounds for funding. After that, she tried her hand as a bartender and other roles.
“When you’re a college student and high achiever, you want a straight path to success, but this was a period of soul searching. At some point, though, you look back and the dots connect—they make sense.”
She says that at the time, these different experiences felt like getting outside of herself. Then, in 2015 she applied to the Statistics PhD program at North Carolina State University, was accepted, and things began to coalesce.
As a graduate student, Dr. Cahoon held multiple internships. One particularly formative experience was her internship as a software developer at RStudio, which makes the open-source IDE for R. The experience was impactful because of the contributions she made to the open-source community. “I was able to give back,” she says. R is used heavily in academia, especially in the field of Statistics. To contribute to the development and evolution of the open-source tool felt meaningful to Dr. Cahoon. A mentor there, Max Kuhn, a PhD statistician and software engineer, inspired her with his approach to sharing his life’s work at no cost. “It’s great because it’s like accessing the contents of his mind,” Dr. Cahoon said in reference to his insightful and widely-read book, Applied Predictive Modeling, which (along with other books of his) is available online for free. She asked him why he made it free, and his answer still resonates with her. “That’s one of the great things about life,” said Kuhn, “Give people a platform and see what they do with it.” After she left RStudio, Dr. Cahoon completed her PhD and started at Microsoft shortly after.
Today Dr. Cahoon works at Microsoft’s Gray Systems Lab (GSL), and is proud to be on a team that makes many of its products and materials available open-source too. GSL is an R&D team that develops and evaluates database systems technologies before bringing them to the Azure product line. She and her team currently work with large language models, figuring out how they can be applied across Microsoft. She says one of the data science problems her team is trying to solve stems from the fact that engineers approach their problems with a lot of experience and knowledge, which gives them intuition on, for example, how to configure a parameter.
“Because they have seen so much, they know intuitively how to tune it. But all of that knowledge and intuition is localized to one human.”
GSL, under the guidance of Azure CTO Raghu Ramakrishnan, believes there is a more robust, efficient way to operate, and her team is trying to make new models that can turn that individual intuition into organizational knowledge—a collective mindshare.
What excites Dr. Cahoon about data science now are the advances in natural language processing, especially large language models and prompt engineering.
“We’re close to entering a new era in data science.”
She says programs like OpenAI Playground, GPT-3, and Github Co-Pilot, that take in a doc string or natural language command and converts it to code, are able to increase productivity of desk workers and tech workers by 4x-5x. “If we can take what Github Co-Pilot has done and get a co-pilot to apply to other things, that will allow us to augment ourselves so we can have more of our lives dedicated outside of work.” She is excited that the applications of this technology are far-reaching. Just recently, people have been able to type up research notes and use GPT-3 to write abstracts that they’ve submitted and had accepted at peer reviewed journals. Assuaging concerns of anyone who may worry about the increased productivity leading to job loss in the tech industry, Dr. Cahoon says, “Think of it like a cell phone. Yes, it can do a lot, but it still needs a human to operate it.” She says, with the infrastructure and tools becoming available to us, you don’t have to be amazing at code—now, the focus of a career can be on the bleeding-edge science.
And how does Dr. Cahoon stay up to date on the newest discoveries in data science? She says she uses a customized RSS feed, which includes news sites that interest her and Google Scholar links to researchers she follows. This way she is able to read the articles her favorite scholars are publishing right as they come out. “It’s my own curated news feed”. It’s her way of seeing which startups are getting funded and where venture capitalists are putting their money. One other factor in her life that helps her keep up to speed is instructing others. For RStudio, she took rigorous instructor tests, to be able to teach in the R community. Now at Microsoft, there is a conscious effort in her group to make sure people are teaching. Recently Dr. Cahoon taught Introductions to Statistics to the solutions architects at Microsoft who are communicating with customers—customers who are making major decisions that have a real impact on the day to day lives of the millions of people across the globe.
“It’s great to give them statistical approaches so they don’t have to just go with their gut feeling,”
she says of the experience. She was also very active in the R-Ladies meet up group for about 5 years, but has turned her attention to another great shift in her life, she’s become a mother. She says she loved opportunities to teach, ad hoc, because they keep her mind fresh and connected to important data science topics.
But the fact that data science is both a far-reaching field and an elusive term is not lost on Dr. Cahoon. In reality, the ambiguity of its origins and evolution is something she values. She says,
“Data science is sprinkled into everything we do, especially today where so much data is available. Everyone is a data scientist—if they’re thinking critically.”
In describing what she means by “data science”, she points to a 2015 article called “50 Years of Data Science” by Dr. David Donoho that was based on a presentation he gave at workshop in Princeton, NJ that year. “I’ve read that article every year since it came out,” she says, “because it is timeless. Every time you read it, you get something new out of it.” What stands out to her is that the article describes three roles in academia where data science has its empirical roots: Statistics, Computer Science, and Mathematics. “Their claims to data science are all true—they are all applying their own rigorous proof.” She says she has yet to see one definition of data science that has stood the test of time. And like the various experiences in Dr. Cahoon’s own eclectic journey to her position as Senior Data Scientist at Microsoft’s GSL, the world of data science would not be what it is today without the diverse contributors who lay claim to it.
“Every field should claim part of it because we need these different areas and domain experts working together. We need this interdisciplinary approach to make it what it is. All of these actors and agents working together—it is because of them that we have a richer ecosystem in data science than just one field claiming it. Everyone deserves a seat at the table.”