Keynote Speaker
Jaime Teevan, PhD
Chief Scientist and Technical Fellow - MICROSOFT
Jaime Teevan is Chief Scientist and Technical Fellow at Microsoft, where she is responsible for driving research-backed innovation in the company's core products. Jaime is an advocate for finding smarter ways for people to make the most of their time and believes in the positive impact that breaks and recovery have on productivity. She leads Microsoft's future of work initiative, which brings researchers from Microsoft, LinkedIn, and GitHub together to study how the pandemic has changed the way people work. Previously she was Technical Advisor to CEO Satya Nadella and led the Productivity team at Microsoft Research. Jaime has received numerous awards for her research, including the Technology Review TR35, Borg Early Career, Karen Sparck Jones, and SIGIR Test of Time awards. She holds a Ph.D. in AI from MIT, a B.S. from Yale, and is affiliate faculty at the University of Washington.
How Work Changing Changed My Work
We’re in the middle of the most significant change in work practices that we’re likely to see in any of our lifetimes. For the past several millennia, space has been the primary technology people have used to get things done. The coming Hybrid Work Era, however, will be shaped by digital technology. In this talk, I will give an overview of what research tells us about emerging work practices following the rapid move to remote work in March 2020, and I’ll talk about how my role in capturing this research caused me to rethink the work I do. http://aka.ms/nfw
10:30 - 11:55 Breakout Sessions
Cristy Stone
Research Lead, Emerging Technologies - Avanade
Cristy leads research for the Emerging Technologies team at Avanade, with a focus on people systems and how they interact with technology. She has expertise across innovation and several areas of new technology including immersive experiences and conversational AI.
Diana Wolfe
Principal Applied Researcher for Emerging Technologies - Avanade
Diana Wolfe is a doctoral candidate at Seattle Pacific University for Industrial-Organizational Psychology. She has leveraged her understanding of psychology and data sciences to inform her research with Avanade on the subject of emerging technologies. She is the founding member of several social justice-based research collectives: ethicaXmachina and The Social Justice League. Her areas of interest are psychological safety, digital ethics, decolonizing data sciences, and transformational leadership.
The play’s the thing: Using performing arts to create compelling data visualizations
Session - Cristy Stone & Diana Wolfe
When presenting findings, data scientists engage in a creative process of converting statistical outcomes into a cohesive story. A narrative-based representation of the data can contextualize information, simplify complex messages, and leave a lasting impression on audiences. To accomplish this, data scientists should look to the performing arts to improve data visualization processes. In this discussion, we will describe the uses of the play structure and improvisation methodologies to utilize when determining how to present data. Play structure prompts considerations such as: (1) what should be different for the reader after seeing your visualization, (2) what information do you want them to know, and (3) what action do you want them to take? This is done through the process of building tension, achieving a narrative climax, giving the audience catharsis, and finally denouement. Improvisation builds on that structure and asks: (1) would more detail make the picture/message clearer OR (2) do I need to give more information to progress the reader toward the outcome? These two states of improv are called advance and color, which balance description with action in a narrative. Through examples of narrative storytelling, audience members will develop a foundational knowledge of how to engage these processes in their own data visualizations.
Candice Bradley
Instructor & lead - Northwestern University
Candice Bradley is a quantitative anthropologist teaching data science at Northwestern University’s School of Professional Studies She received her PhD from UC Irvine in social science, followed by a Fulbright fellowship to study fertility decline in Kenya. Her research interests include gender, population, social networks, and the history of science. She lives in Redmond, WA.
Visualizing Diversity
Session - Candice Bradley
Face it: When we learned to visualize data, we turned to such eminent visualizers as Edward Tufte, Stephen Few, and Andy Kirk. And so we should because they are among the most important contemporary scholars of data visualization. Nevertheless, their perspectives are limited. Where are the data images of those of color, women, indigenous peoples, or any other underrepresented group? Does data visualization, as we know it, focus primarily on the interests of business and government, often failing to touch on the trio of diversity: race, class, and gender? This talk introduces stellar examples of data visualization from past and contemporary scholars of diverse backgrounds and talks about how we can diversify and thus re-imagine our own data and data visualizations going forward.
Carolina Johnson
Systems Impact Data Scientist - King County Dept of Community and Human Services
Carolina is a data scientist committed to developing ethical public sector data capacity. She has been working with King County for over four years working to support data integration, creative problem-focused data uses, and skill development in a growing team of analysts and evaluators. Before joining King County she completed a PhD in Political Science at UW.
Integrating Administrative Data to Better Measure Homelessness
Session - Carolina Johnson
The Puget Sound region has been experiencing an ongoing homelessness crisis for years. Developing effective solutions and understanding how well they are working requires high-quality and accurate data describing the scale of the problem in our region. However, commonly used methods to estimate the size of the population of people experiencing homelessness in our local region are known to underestimate the number of unhoused people. Complex challenges such as resolving homelessness regionally require the ability to understand the interdependencies between often siloed service systems. Since 2018, the King County Department of Community and Human Services, Public Health - Seattle & King County, and King County Information Technology has invested in building out data infrastructure to link records from across multiple client-level administrative data systems in order to provide cross-system views of programs, services, and community needs. This new data resource became available as the COVID-19 pandemic began in 2020 and has already helped the County to better prioritize scarce resources, plan for pandemic response, and identify gaps between systems serving our most vulnerable neighbors. This new integrated data resource has allowed for a substantial improvement in our ability to better estimate the number of unhoused members of our community. Looking across systems, King County has estimated that there are over 7,000 people being served by social service systems in our region who experienced homeless at some point in 2020 but who were not identified or counted in the data systems usually used to estimate the number of unhoused people in our area. This talk will combine a discussion of the real-world data governance challenges of integrating complex government administrative data, lessons learned from multi-year efforts to bring data analysts, policymakers, and IT professionals together in ongoing collaboration, and an outline of the pragmatic technical approach implemented to build a sustainable integration solution within government. In addition, attendees may gain some new knowledge about an important policy issue impacting the Puget Sound region.
Christina Heaton
GIS/Data Analyst - Floyd|Snider
Christy Heaton is a GIS/Data Analyst at Floyd|Snider, an environmental consulting firm in Seattle, Washington. She has been a mapping professional and Python advocate for over 10 years with experience working in the public and private sectors and education. As a co-organizer of PyLadies Seattle, she strives to assist her local Python community and support underrepresented groups in technology.
From Dirty to Clean: Addressing Environmental Contamination with Python
Session - Christina Heaton
Environmental consultants work with a variety of industries to provide insight into environmental concerns. Cleanup levels for chemicals, set by local and national agencies, must be achieved to protect the health of humans and the environment. Data management and visualization are key to assisting scientists and engineers make informed decisions. As a widely used language, Python is a great resource with libraries such as pandas to process tabular data, Matplotlib and Seaborn to create charts and graphs, and GeoPandas to create geographic charts and process geospatial data. Jupyter Notebooks provide the perfect environment to write scripts, view outputs, and share methods. Case studies will be presented using these tools to demonstrate the life cycle of cleanup projects, discuss data gaps, identify areas that have chemical concentrations exceeding cleanup levels, apply remedies, and track the efficacy of remediations. Attendees will leave the talk with an understanding of environmental consulting, environmental cleanup levels, and how Python can be used to visualize environmental data to support informed decision-making.
Frederike Dubeau
Manager, Advanced Analytics - Logic20/20
Frederike is a Manager of Advanced Analytics at Logic20/20, a business technology consulting firm. Prior to joining Logic20/20 in 2021, Frederike spent 8 years at PACCAR where she helped grow their dedicated Data Science teams across multiple divisions of the company. Frederike has her bachelor’s in Business Administration with a focus in Supply Chain Management as well as her master’s degree in Predictive Analytics from Northwestern University. In her spare time, Frederike enjoys hiking and endurance sports such as running and participating in triathlons.
Harnessing the power of data science to reduce wildfire risk
Session - Frederike Dubeau
In 2020, California wildfires burned more than 4 million acres of land, destroyed more than 10,000 structures, and caused over $10 billion in damage. These record-setting numbers emphasize the need for risk mitigation—and utility companies are looking to data scientists for help. Wondering how data science can fight fires? Teams can analyze field data to identify risks and initiate maintenance before disaster strikes. By improving data quality and building cutting-edge machine learning models, they've created a reusable structure for risk reduction and cost optimization. In this talk, we'll discuss our work with a large utility company, including the statistical analysis of their current risk mitigation initiatives. We'll also outline how her team built a machine learning model to prioritize the company’s future efforts. Attendees will learn:
Primary contributors to wildfire risk
Details from the statistical analysis used
How the machine learning model was built
Outcomes of the project
How to scale this solution for the future
Victoria Hunt
Data Scientist - Breakthrough Energy
Victoria Hunt, PhD is a data scientist for Breakthrough Energy. In this role, she currently supports the Catalyst program, which aims to accelerate the deployment of clean energy technologies. She is keenly interested in climate policy and in supporting climate action through data visualization and data storytelling. Victoria’s passion for policy is also reflected in her pursuits outside of her role at Breakthrough Energy; she currently is a city council member for the city of Issaquah, and in this role serves on several regional boards and commissions.
Modeling Impacts of Investments in Emerging Climate Technologies
Session - Victoria Hunt
The Emerging Climate Technologies (ECT) Framework is an analytical framework that estimates the forward-looking impacts of large-scale investments in emerging climate technologies. The framework can be used to visualize and understand the role of investment in ECT in achieving a zero-carbon economy. Ultimately, by quantifying the beneficial impacts of investments in terms of emissions avoided, the framework can provide an incentive, driving investment into technologies that will have the greatest impacts. I will explain how we use a model of technology diffusion to first create a baseline forecast for a given technology, and then shift that baseline under a scenario entailing early investment. I will describe our forecasting methodology, as well as our impact analytics and cost decline modeling. To illustrate our methodology, I will use a case study in which I evaluate the impacts of early investment in solar PV (photovoltaic).
Nivii Kalavakonda
PhD Student - Electrical and Computer Engineering - University of Washington
Niveditha Kalavakonda is a Ph.D. student in the Electrical and Computer Engineering department at the University of Washington. She is a part of the BioRobotics Lab, working with with Prof. Blake Hannaford in the field of surgical robotics. Her research interests are broadly in Human-Robot Interaction and Computer Vision. She is also a part of the Science, Technology and Society Studies (STSS) program at UW, working on Tech Policy research for robotics, advised by Prof. Ryan Calo.
Role of Computer Vision in Revolutionizing Computer-Assisted Surgery
Session - Nivii Kalavakonda
Computer-assisted surgery(CAS) has evolved with the advances in technology to augment a surgeon’s ability to administer surgical procedures. It assists in decreasing the invasiveness of surgical procedures and improving it’s safety and accuracy. This talk will provide an overview of the work in computer vision and surgical robotics that the author and her colleagues have done in CAS at the BioRobotics Lab at the University of Washington.
Kaitlyn Petronglo
Advanced Analytics Consultant - Logic20/20
Kaitlyn Petronglo is a Senior Consultant at Logic20/20 where she helps clients maximize their investment in machine learning and advanced analytics. Kaitlyn has over eight years of experience as a project manager, scrum leader, and data analytics consultant. She is passionate about using data to solve critical problems and enjoys coaching high-velocity teams using agile techniques. Kaitlyn is a certified Project Management Professional (PMP) and Certified ScrumMaster (CSM). She also holds a bachelor’s in English Literature from The Catholic University of America and a certificate in machine learning methods from the University of California San Diego.
Data science that delivers: Applying agile methodology in exploratory teams
Session - Kaitlyn Petronglo
Exploration is vital to your data science team’s success, but constant circling without delivering can quickly turn a strategic investment into a budgetary drain. How can you reap the benefits of Agile while managing the complexities and ambiguities of data science work? Applying Agile methodologies helps data science teams refocus on business impact—the real reason a data science team is employed. This requires a shift in mindset towards delivering data science products to solve critical business challenges, enabling more relevant deliverables and increased productivity. Logic20/20 Manager and Scrum Master Kaitlyn Petronglo will discuss the framework she developed to guide data science teams from the exploratory phase to model development and deployment. Speaking from experience, Kaitlyn will outline the advantages of Agile methodologies for data science and discuss how you can build flexibility into scrum processes to best serve dynamic data science teams. Attendees will learn:1. What an Agile approach to data science looks like 2. Common barriers to adoption 3. Benefits of adopting an Agile approach 4. Top Agile frameworks being used today
Lisa Cohen
Director of Data Science - Twitter
Lisa Cohen leads an organization of data scientists, which helps develop the Twitter product through data-driven insights. Her org is responsible for identifying causal indicators, informing product strategy, driving learnings through experimentation, and advancing Twitter’s machine learning models. Prior to Twitter, Lisa led data science for Microsoft cloud services, focused on Azure customer growth. She also helped ship multiple versions of Visual Studio. Lisa holds Bachelor and Masters degrees from Harvard in Applied Mathematics. Outside of work, she's an avid softball player and enjoys spending time in nature with her family. You can follow Lisa on LinkedIn and Medium.
Tips for Effective Data Science in the Enterprise
Session - Lisa Cohen
Data Science is an evolving field, that requires a diverse skill set. From Career Advice to a Data Science project Workflow, this talk is full of practical tips that you can apply immediately to your job. In particular, we’ll look at how data science is performed and organized within enterprises, provide guidance for navigating organizational structures, and suggest approaches for implementing data lifecycles. Throughout the talk, Lisa provides advice on how to grow a career in enterprise data and analytics, based on her 17-year career in the industry. Slides: https://www2.slideshare.net/lisafeig/tips-for-effective-data-science-in-the-enterprise
Panel of Women Leaders in Data Science
WORKSHOP - Lisa Cohen
Data Science is a broad discipline, involving multiple fields and roles. Looking at how data science is performed across the industry, there are many common challenges and solutions - as well as some unique ones too. Come hear from a panel of women leaders in data science at various companies, on frequently asked questions - and ask some questions of your own! They'll cover topics such as choosing a role in data science, how to organize data science organizations, effective stakeholder management, planning & OKRs, hosting ML models in production, recommended training resources & communities, and learnings from their own career journeys.
Shweta Singhal
Senior Software Engineer - Amazon
Shweta Singhal is a senior software engineer at Amazon who works on developing fraud detection models for making Amazon a secure place for its customers. As of June 2021, she is also working on understanding a specific domain and researching new ways to understand bad actors. She completed her Master’s in computer science with a thesis from the University of Utah in May 2017. Shweta has started from an implicit authentication space where she learned the value of privacy and security. Shweta has been a keen advocate of ethically and morally right practices within her organization at Amazon. She currently runs the mentoring circle with Data circles for expanding knowledge of fundamentals to the group. In her spare time, Shweta enjoys the outdoors & travel. She can be found either hiking, biking, or cozying up at some nice waterfront brunch place (or cafe) on Sundays.
Word embeddings journey and today's Transformers
Session - Shweta Singhal
In this talk, I would like to discuss the word embeddings journey in NLU with the group, and current, state-of-the-art architecture, called Transformers. Consider this talk for the requirement or gaps which led to the discovery of transformers, and focus mainly on the architectural and internal components of Transformer. Briefly, I would like to name the top performer NLU model architectures and let the group explore them on their own. This talk will be beneficial for anyone looking to enter NLU, and/or wanting to understand Transformers for understanding textual data. Word embedding has been a crucial technique in incorporating textual information into learning and has been quite old. Every human uses some of these techniques automatically in communication, and day-to-day conversation. Making sense of words, sentences, or even paragraphs comes very naturally to human beings via languages. Teaching the same techniques for computers to understand becomes a gigantic task. Hence, in this talk, we will learn to make a parallel between human learning and machine learning for showcasing the growth of word embedding techniques in the ML community and will be focused on word embedding as a starting point.
1:30 - 2:55 Breakout Sessions
Apurvaa Subramaniam
Senior Data Scientist - Instacart
Apurvaa is a senior data scientist on the Ads team at Instacart. Prior to this, she worked at Amazon for ~4 years on Product, Operations and Strategy teams on a variety of data science/analytics projects such as experiment design, predictive modeling and causal inference. Outside of work, Apurvaa loves traveling, exploring the great outdoors of the Pacific Northwest, cooking and reading.
Data Literacy in the age of misinformation
Session - Apurvaa Subramaniam
IBM estimates that in 2020 at least 1.7 MB of data were created every second for every person on earth—the same amount of data needed to store an 850-page book, per second, and this is just expected to grow exponentially. Internet growth statistics from Statista show that 4.66 billion people are using the internet as of January 2021 which is close to 60% of the world. Today we have more access to data and information than ever before in history. Yet, studies show trust in science in the general population is declining. This has been especially pronounced since the start of the covid pandemic: misinformation is rampant. In this talk, I will discuss data literacy i.e. the ability to consume, interpret and understand data to better make sense of everything around us. I will give you guidelines and tips to spot misinformation, and red flags to watch out for. I will walk you through a simple framework which you can use to evaluate information presented in the media and understand its limitations as well as identify potential misinterpretations of data. More specifically:
Using a sample dataset, show how the same data can be used to create conflicting narratives by using misleading data visualizations. I will give you some tips to spot deceptive data visualizations using real-world examples.
Discuss examples of data being taken out of context/ cherry-picking of data/spurious correlations.
Introduce unconscious (or sometimes conscious) biases in research studies: information bias (p-hacking) and selection bias.
Briefly explain Simpson’s Paradox and confounding.
Resources for further reading.
Chhaya Choudhary
Senior Data Scientist - Zscaler Inc.
Chhaya Choudhary is a Senior Data Scientist at Zscaler where she works on developing machine learning algorithms for Zscaler’s Zero-Trust Workload Segmentation (ZWS). Before joining Zscaler, Chhaya worked at Infoblox where her research focused on using machine learning and deep learning to solve challenging data problems involving malware detection, URL & phishing classification as well as device identification. She has a master’s degree in computer science with a specialization in Data Science from the University of Washington Tacoma. She has published several research papers and spoke at conferences like IEEE BigData, Usenix Security, SSCC, WiDS @UW Tacoma about her work and is passionate about promoting women in tech.
Laura T. da Rocha
Data Scientist - Infoblox
Laura Rocha is a Data Scientist at Infoblox, where she focuses on statistical learning for networking and security solutions in the cloud. She has a bachelor's degree in statistics from the University of Brasilia and is currently studying for a master's in computer science. Laura loves interacting with people new to the industry to help them succeed in their transition. In her free time, she enjoys crafting and crocheting gifts.
The Adventures of Reproducibility in Machine Learning: A Cyber Security Use Case
Session - Chhaya Choudhary & Laura T da Rocha
Creating generic classifiers with reproducible results that hold across different datasets is difficult. Being highly dependent on the underlying statistics of the data the classifier was trained on, they are easy to overfit and hence have lower performance on new unseen data. This challenge is rarely discussed in academic courses - but is often faced by Data Scientists early in their careers. In this talk, we present a use case that applies machine learning algorithms to malware detection in Portable Executable (PE) files. We’ll explore the challenges of reproducing the results of classifiers in real-world applications, share our findings, and discuss improvements to the classifier-building process for better generalization.
Sharon Xia
Principal PM Manager - Microsoft
Sharon Xia is a Principal PM Manager of the Microsoft Azure Security Data Science team. She leads a PM team that applies data science techniques to security to help customers combat cyberattacks. Sharon has more than 15 years of experience leading cybersecurity vision and strategy in mission-critical products. She has a great track record for delivering ML features in security products successfully. Sharon is passionate about fostering a diverse and inclusive culture in her team and in the community, especially in helping women succeed in STEM and data science fields.
Detecting Emerging Cyber Threats Using Machine Learning
Session - Sharon Xia
Companies are struggling to keep up with rapidly evolving attack techniques. Traditional rule-based detections and manual responses cannot handle the scale of data, the ever-changing attack techniques, and the AI-driven machine-speed attacks. Using machine learning for security is not only a market trend but also a necessity to fight against sophisticated cyberattacks. This presentation will show you how we use machine learning algorithms to detect advanced multistage cyberattacks early to prevent and mitigate damage to businesses and people’s lives.
Louisa van Zeeland, PhD
Research Engineer
Louisa has a Ph.D in Electrical Engineering with a focus in signal processing from University of California, San Diego. She works with marine scientists in exploring the benefits of adding machine learning and machine perception to their workflows.
Demystifying Acoustic Data for Dolphin Identification
Session - Louisa van Zeeland, PhD
When we talk about artificial intelligence (AI) and machine learning (ML), we often visualize autonomous vehicles, computer vision tasks, or natural language processing. Yet few people use acoustic data as a medium for ML and even fewer use marine acoustics. What is causing this slow adaptation of widespread use of marine acoustic data in the ML community? There are many possible explanations. Acoustic data often requires experts to label training data, and we cannot easily visualize acoustic data without additional preprocessing steps that can at times require more than applying the Fourier transform. Perhaps as humans are a visually-oriented species so we are more inclined to view and understand our surroundings in a visual representation. In contrast, sound travels thousands of kilometers underwater, therefore marine mammals rely heavily on sound to understand their surroundings in which visibility is limited. For the past two years, I have been working with experts who have devoted decades to studying the social behavior and communication of cetaceans, such as Bottlenose dolphins. Bottlenose dolphins develop distinctive signature whistles that are used to recognize familiar conspecifics both to retain or re-establish group cohesion and to mediate group encounters at sea. The audio signals generated by dolphins are key to our understanding of how they live and communicate. Applying ML to process underwater acoustic data achieved good results with Tensorflow’s built-in models. For example, applying MobileNetV2 to classify individual dolphins using their signature whistles achieved a validation accuracy of 97.3%. This performance is comparable to human annotators’ accuracy but requires less time manually processing the data. Based on the initial success of this project, I have started exploring beyond marine mammal sounds and expanding our work to characterize the soundscape of the Salish Sea.
In this talk, I will walk you through how we process audio signals, and share with you a data pipeline I developed when I first worked with auditory data collected from patients with Parkinson’s disease. This setup has since served as a starting point for my acoustics ML projects and it will equip you to start experimenting with acoustic data. It is my hope that you will experience this fascinating world of underwater acoustics and be equipped to contribute to open-source projects. Perhaps even automate portions of workflows for scientists without the time or funding to manually sort and annotate the massive amount of acoustic data. I am interested in hearing how acoustic analysis fits your needs. I will answer any questions as best as I can and explore algorithms with you afterward. It would be an honor to make this journey with you.
Alexandra Ralevski, PhD
Data Scientist - Institute for Systems Biology
Alexandra Ralevski is a Data Scientist at the Institute for Systems Biology where she develops risk models to better predict disease transitions for a variety of conditions, including diabetes and autoimmune diseases. Before joining ISB, she worked on the NASA PeTaL project to develop an AI system to mine the scientific literature for nature-inspired design. She was previously the Science Director at the Biomimicry Institute, where she scouted biomimicry technologies and developed an AI tool that allowed users to organize, analyze, and interpret data from scientific papers. She is a biology and biomimicry expert and was previously a postdoctoral associate at Yale University engaged in translational research at the intersection of plant biology and neuroscience.
Nature-Inspired Design at NASA: Building an AI tool to uncover nature's secrets
Session - Alexandra Ralevski, PHD
Nature-inspired design, also known as biomimicry, is the practice of learning from and emulating nature to solve human design problems. It has helped drive human innovation for centuries, but much of this knowledge is buried deep in the scientific literature. The NASA Periodic Table of Life (PeTaL) project aims to build a search tool that uses machine learning to scan and catalog decades of scientific literature and highlight millions of design ideas from nature. This talk will discuss how we gathered our data and built our models, as well as the various challenges we encountered.
Gracie Ermi
Research Engineer
Gracie Ermi is a research engineer in Seattle, where she has 4+ years of experience building technology platforms to support wildlife research. She utilizes computer vision and acoustic deep learning techniques to save researchers’ time in interpreting and collecting data, so that they can work more efficiently. She’s worked on projects related to sharks, elephants, killer whales and bottlenose dolphins! In addition to tech and conservation, Gracie loves science communication, and she has shared her work and her STEM story with audiences around the world. Gracie holds both a B.S. and a M.S. in computer science from Western Washington University, and in her free time she enjoys knitting and hiking in the incredible Pacific Northwest.
Data Science for Wildlife
Session - Gracie Ermi
Wildlife conservation is a field that is ripe for technological innovations, and groups across the globe are applying data science to enable more efficient solutions to environmental challenges. Machine learning is increasingly being used to automatically count, identify, and observe animal populations on land and underwater using images, video, remote sensing, acoustic data, and more. Traditionally, gathering insights from this type of wildlife data has required countless hours of manual data processing and analysis by researchers, so machine learning techniques - if deployed thoughtfully - have the potential to provide massive time savings to the scientists who use their findings to influence policy and generally make our ecosystems healthier. Successfully employing ML for wildlife relies on close collaboration with the wildlife scientists themselves, whose expertise is critical to designing solutions that utilize data science to provide real benefits to environmental research. Join this session to learn more about 1) examples of projects using data science to aid wildlife conservation, 2) the biggest challenges in applying machine learning to these real-world, interdisciplinary, multi-faceted challenges, and 3) the wide variety of exciting opportunities for data scientists at any stage of their career to use their skills to protect our planet.
Victoria Dotta
AI Specialist - Tryolabs
Victoria Dotta is an Economist specializing in Data Science with more than 5 years of experience seeking opportunities to transform data into actionable insights that create business value. As an avid learner, she now holds an MBA from IEEM Business School in Montevideo, Uruguay. As an AI Specialist in Tryolabs, she has a rich professional background from working both in the retail and fintech industries. Victoria’s experience covers several projects related to pricing strategy, demand forecasting, business analytics, and digital transformation. Victoria is also part of the academy as a Data Science teacher at Universidad de Montevideo, which helps her keep updated on the latest papers and investigations around data science.
AI solutions to boost retail results: product matching & price optimization use-cases
Session - Victoria Dotta
The talk is about AI use cases in retail, including price optimization and product matching. This submission is focused on a model that finds the optimal price for each item while considering many factors: seasonality, macroeconomics, and competition changes. It also discusses assortment gap analysis which aims to discover opportunities for introducing new products or variants that can help the retailer boost its sales and margins. This project has a capability for product matching that uses a product attribute extraction pipeline based on a combination of traditional methods, NLP, and computer vision.
Anisha Dubhashi
Data Scientist - Nordstrom
Anisha is a Data Scientist on Nordstrom's Merchandising Analytics team. She works on developing analytics products related to inventory forecasting, price optimization, and size curves. She holds a master's in analytics from Northwestern University and a bachelor's in mathematics and economics from UCLA.
Dana Lindquist, PhD
Sr Technical Program Manager - Nordstrom
Dana discovered data science about 4 years ago while working as a project manager at a company that had a great deal of data. Her career had trended away from numerical methods where she received a PhD some years ago. To get more involved with data science she enrolled in the Metis Data Science Bootcamp after which she worked as a data scientist for 3 years. She recently joined Nordstrom as a Sr. Technical Program Manager for a data science team, pulling together much of her past experience.
Jessica Marx
Data Scientist - Nordstrom
Jessica is a Data Scientist at Nordstrom where she builds analytical tools and software supporting Merchandising and Price; past work includes supporting Clickstream and Product. In addition to Nordycast, she's worked on Assortment Health/Optimization, Duplication Metrics, and Markdown Optimization. Through internal programs she teaches SQL and Python to stakeholders and junior team members. She has an undergraduate degree in Film from NYU (how she got from there to here is a long story). If it's sunny outside, she's probably roller skating.
Nordycast: A Flexible Forecasting Library
Session - Anisha Dubhashi, Dana Lindquist & Jessica Marx
There are many teams and countless individuals creating and using forecasting models at Nordstrom. This represents a huge duplication of effort, but until now there was no effective way to efficiently share these models and the code surrounding the models. Enter Nordycast, a way for novices to use existing models and experts to share the models they have created. Nordycast aggregates all the setup and visualization code that is needed when creating forecasting models and is focused specifically on forecasts using Nordstrom data. It is the company’s first venture creating a shared library. High-quality forecast modeling is hard to do and time-consuming. It starts with pulling together the relevant data needed for the forecast, setting up target variables, adding lags and seasonality, filling in missing data, and creating a train-test split. Only when this data preparation work is done can the actual modeling start. Forecast modeling steps generally include hyperparameter tuning using time series cross-validation, creating a naïve baseline model for model evaluation, and aggregating multiple models to create an ensemble model. The final models and corresponding metrics need to be saved to be used in the future, and the entire pipeline may need to run in parallel depending on how many models exist. During this process, the features, model metrics, and output need to be visualized. Nordycast simplifies this entire process by providing functionality to prepare the data, create multiple forecasting models, parallelize the pipeline, and visualize the results. Nordycast is also tailored to the type of forecasts and data used at Nordstrom. Since Nordycast is an open-source library (inside the company), data scientists can contribute new models and new functionality to the library. Experienced data scientists will be able to add to the library of models, giving them a place to highlight and share their work. Less experienced data analysts and data scientists will be able to quickly spin up a forecasting model. The beauty is, that the repetitive work of creating a forecasting model is packaged in Nordycast, allowing someone creating a new model to focus on the model. To date, the response at Nordstrom has been very favorable. Users of Nordycast are particularly excited about the ease of using Nordycast and the significant time savings, especially in model setup. Future contributors are also excited about adding deep learning and time series forecasting methods to Nordycast’s model library. In our talk, we will review our process of building Nordycast, maintaining the open-source package, user engagement, and our plans for the future.
Julie Albers
Senior Director, Data Science & Analytics - Nordstrom
After brief forays into music, art, and advertising, Julie finally settled on retail analytics and hasn’t looked back since. As a leader in Data Science at Nordstrom, she gets to focus on three passions – improving the customer experience, mentoring, and fashion. Prior to Nordstrom Julie supported analytics at URBN, Anthropologie, and QVC. She holds degrees in Advertising and Statistics.
Building a product that actually gets used
Session - Julie Albers
The value of our data products is determined as much by impact as it is by quality. One of the top drivers of employee satisfaction is knowing that our work makes a difference. In this talk, I will share some tips on how to ensure product adoption to improve business performance, increase the value of data science in your organization, and grow team morale.
Sarah Johnson
Community Engagement Manager - Coiled
Sarah is a Community Engagement Manager at Coiled, on the Dask engineering team. She transitioned into her current role after 7+ years in academia, first in neuroscience research at the Oregon Health and Science University, then in global health at the Institute for Health Metrics and Evaluation. After being on the user side of the equation, Sarah is happy to be in the open-source software development community, with a focus on continuing to grow the Dask community. In her free time, she can be found climbing rocks and backpacking in the PNW, while also teaching her Catahoula/Hound pup Copper to be a real dog and enjoy the outdoors. She holds a BA from Reed College and an MSc from the London School of Hygiene and Tropical Medicine.
Scaling Data Science in Python with Dask
WORKSHOP - Sarah Johnson
Dask is an open-source library for parallel and distributed computing in Python, which supports larger-than-memory computation, enabling data processing and modeling for datasets that don’t fit in RAM. Dask is used in a wide range of domains from finance and retail to academia and life sciences. It is also leveraged internally by numerous special-purpose tools like XGBoost, RAPIDS, PyTorch, Prefect, Airflow, and more. It’s developed in ongoing collaboration with the PyData community making it easy to learn, integrate, and operate. The familiarity of collections like Dask DataFrame to pandas allows you to quickly get started on your hardware of choice– be it your laptop, cloud service, or HPC cluster.
In this hands-on workshop, we’ll show you how to scale data science workflows. If you are already familiar with pandas or scikit-learn, Dask syntax will be easy to follow. Attendees will come away with a basic understanding of parallel computing, how to use the Dask for DataFrame manipulation and application of common machine learning algorithms, and how to interpret the Bokeh-powered diagnostic dashboard. Specifically, this workshop will cover: (1) the basic concepts of parallel computing, (2) using Dask DataFrame for efficient data loading and processing, (3) the application of common machine learning algorithms, and (4) how to interpret the live diagnostic dashboard.
3:30 - 4:55 Breakout Sessions
Phoebe Liang
Senior Data Scientist - Trupanion
Phoebe Liang is a data scientist at Trupanion, a company that offers medical pet insurance for cats and dogs. She works primarily on building machine learning models to automate the adjudication of veterinary claims. Prior to Trupanion, she received her master’s degree in Information Science at the University of Michigan. She is passionate about applying data science to solve real business problems.
Automating Pet Insurance Claim Process with Machine Learning
Session - Phoebe Liang
Trupanion is a company offering medical pet insurance for cats and dogs. The data science team at Trupanion is responsible for automating the adjudication of veterinary claims.
Adjudicating a veterinary claim is a complicated process and involves several challenges. One of the biggest challenges is getting structured data from invoices that are often in various formats. To address this challenge, we have built our own application allowing veterinary hospitals to submit claim forms electronically and an integrative backend system connecting to machine learning models.
The next biggest challenge is determining the correct payout amount of the pet owner. Analyzing and interpreting invoices, medical records, and the pet’s past claim history often requires a lot of subject matter knowledge and experience in the field. For example, we do not cover any pre-existing condition and the diagnosis on the claim form can be vague such as “not feeling well”. As a result, we designed a two-stage framework that breaks down the human decision-making process into several steps and builds independent models that run in parallel. These models use natural language processing techniques and leverage features extracted not only from data of the invoice pet but also from prior claim data from all the other pets, to discover connections among different visits. Finally, an aggregator model collects results from all the independent models and makes decisions holistically on the claim level.
In this talk, we will describe the journey of how we build independent models to predict the outcome of each thought process a human goes through. We will then discuss how we aggregate results from these smaller models to make the final decision. Last but not least, we will talk about how we overcome the challenge of limited engineering resources by taking ownership of model deployment and selecting an MLOps platform with the capability to create and manage model APIs easily.
Ty Yano
Sr Machine Learning Scientist / Technical Lead - Expedia Group
Tae is a Sr Machine Learning Scientist at Expedia Group.
Demythifying Word Embedding and Representation Learning
Session - Ty Yano
Word Embeddings (such as BERT or GPT-3) are one of the most versatile machine learning tools in recent years. Although using these tools at a perfunctory level is quite easy, practitioners without an NLP primer often stumble to go beyond a proof-of-concept. A part of the problem is a lack of a clear understanding of how these embeddings are learned. Embedding Learning often referred to as Representation Learning or Representation Induction, is not a typical (supervised) machine learning. In the supervised learning paradigm, learning objectives are always explicit and clearly tied to some types of predictions (e.g., regression or classification). This is not the case with Representation Learning; a variety of prediction models, Neural networks or otherwise, can be used for the purpose. Although this heterogeneity makes Representation Learning an exciting area of research, the sheer variety of neural architectures and their increasing complexity overwhelm many fledgling ML practitioners. In this talk, I will attempt to fill in this gap by presenting a concise overview of the fundamental concepts behind Representation Learning. I first cover some general ML ideas such as Unsupervised Learning, Language Modeling, and Distributional Semantics, as well as some specific algorithms such as skip-grams or masked language models. The goal of this section is to give the audience a more organic grasp of the basic approach, and mode of operation, common in many varieties of Representation Learning. In the second section, I will present some real-world use cases of Word Embedding and Representation Learning, showing how some of the inspired applications are tied to (and motivated by) the core concepts.
Ailin Fan, PhD
Senior Data Scientist - Shopify
Ailin is a senior data scientist at Shopify, where she builds data-informed products that enable merchants to start and grow their businesses on Shopify. She works on multiple areas of growth marketing and leverages data to pivot the marketing strategy and maximize the impact. Prior to joining Shopify, she worked at a start-up and a bank, where she delivered AI solutions for enterprise retailers and turned the firm’s massive data into useful insights. She holds a PhD in Statistics from North Carolina State University, specializing in developing novel statistical methodologies for personalized dynamic treatment regimes.
Data informed email nurture journeys
Session - Ailin Fan
At Shopify, our mission is to make commerce better for everyone. With over 2 million merchants using our product, we understand that starting a business is hard, and we dedicate a lot of our energy to mitigating those challenges and to empowering our merchants to succeed. In this session, we’ll talk about how we leverage data to deliver the most relevant messages to our customers via our email channels, and how we run experiments to improve that nurture journey. We’ll share effective lessons, ideas, and technical methodologies we learned along the way, and discuss how they are being utilized to further enhance and grow the Shopify platform.
Vaishali Khandelwal
Data and Applied Scientist - Microsoft
Vaishali is a Data and Applied Scientist at Microsoft since 2017. She holds a Bachelor’s in Computer Science from the University of Illinois at Urbana-Champaign. Driven by her curiosity to dive deeper into data science, she is pursuing a part-time Master’s in Data Science from UC Berkeley, graduating in July 2022. She focuses on storytelling with data, leveraging insights to recognize and alleviate customer hassles, and running deep analytics to enhance overall customer experience and achieve business goals. Her involvement ranges in a wide array of products, from Start Menu and Notifications on Windows 10 to Account Management for Subscriptions, Devices, Profile for Microsoft. She was the Vice President and Mentoring Chair for the Women in Computer Science organization back in undergrad and more recently, has been involved in mentoring, enabling, and empowering women to build careers in data science. Outside work, she’s a true Pacific Northwester, spending weekends outdoors either hiking, skiing, or skating. An avid traveler, she loves exploring new places, meeting people from around the world, and learning about new cultures.
How experimentation helped a small business grow during the pandemic!
Session - Vaishali Khandelwal
Using experimentation and causality, we helped a small restaurant grow its revenue and sales during the pandemic for dine-in orders. Running a field experiment as part of a class project from the part-time Master’s at Berkeley, we applied the principles and framework of experimentation to understand the effect of descriptive menu labels. Since the restaurant is a small business with a primitive infrastructure, we went about setting up the nits and grits of the experimentation skeleton from scratch. From designing the changes to differentiate between control and treatment to manual randomization and training the restaurant staff, we went through the steps intricately to run the field experiment correctly and resultantly establish causality. The various steps included identifying the least popular dishes, brainstorming alternate names for dishes to block on different description types, identifying the outcomes variables, calculating the power, and finally running the experiment. With all of these steps in, once the data was ready after running the experiment over a 2 week period, we calculated the impact of the experiment through a comparison of the count of dishes, sales, money spent, covariate of the day of the week, and finally looked at the overall sales pre-experiment and during the experiment. Not only did this simple field experiment help the sales go up, but it also helped the restaurant get customer feedback and learn more about what works best for their audience! This is a simple example of the huge impact experimentation and insights can have on the growth of small businesses!
Shiyi Pickrell
SVP of Data Science and Analytics - Expedia Group
Shiyi leads the central data science and analytics function at Expedia Group. Throughout EG's journey to transform into a platform company, her team is building ML with scale, speed, and simplicity. She is an ex-Microsofter, ex-Amazonian. She holds a Ph.D. in neuroscience and an MBA.
Zoe Yang
Senior Director, Machine Learning Science - Expedia Group
Zoe Yang is a data scientist by training and has led cross-functional organizations delivering data science and machine learning capabilities in various industries. Zoe champions AI power, analytical rigor, customer-backed research, and Lean Startup principles to build disruptive, innovative AI products that have a meaningful business and social impact. Before working in the Tech industry, Zoe worked in investment banking and management consulting. She lead with heart, humility, authenticity. Zoe believes being a leader isn’t about “know it all” but how to inspire and empower others to discover their greatest potentials. She believes vulnerability is not a weakness but a strength to seek understanding and connect personally. Zoe believes an environment of trust, collaboration and belonging is key to creativity and productivity. At Expedia Group, Zoe is truly thrilled to have the opportunity to build scalable, high performing, and tremendously impactful AI products and services to power global travel for everyone, everywhere.
Personalization at Scale
Session - Shiyi Pickrell & Zoe Yang
I lead the central ML/AI team (200 employees) for one of the top 3 largest travel companies. Our company has achieved personalization at scale by the following platform architecture to avoid one-off 'snowflake" customized solutions. I would like to take this opportunity to share our practice and learning with the WiDS community. The five key components are 1) general identity of our customer across many subsidiaries and brands that provides who the customer is, user general preference, a past purchase, etc. 2) customer journey context services that predict the customer's intent, interest, and needs at a particular moment, session throughout customer inspire, shop, purchase, and post-purchase journey 3) many state-of-the-art machine learning models that predict and recommend personalized messages, offers, products, next action to customers based on the general profile and momentarily needs 4) our plug and play intelligence service engine that allows engineers to plug any in-house or 3rd party ML models into the backend for the application. 5) in-house machine learning and data platform, feature stores that allow model training and inferencing in real-time with a feedback loop. In the talk, I will walk the audience through each component and how they function together to achieve personalization at scale and give 3 real-world examples. In addition, I will zoom into a couple of in-house models, e.g. contextual engagement recommender model, customer-item vector, intent prediction model, and reinforcement learning.
Elizabeth Beebe
Data Scientist - Gusto
Elizabeth is a data scientist at Gusto, a cloud-based payroll startup, where she works on understanding how employers use Gusto to run their businesses. She has more than a decade of experience in data science, has spoken at Data Mishaps Night, loves mentoring, and spends too much time thinking about the how of data science.
Formal Metrics, or How I Learned to Stop Worrying and Love the YAML Spec
Session - Elizabeth Beebe
Formal Metrics are a way of specifying your data and how you transform it in such a way as to become a useful metric. While this seems simple enough, the idea is only now starting to take hold in the data science world and the implications are incredibly broad: by defining a data source, a transformation on top of that source, and a few other bits of metadata we can create a system that allows us to glue together valid pieces of data to create metrics. And better yet: these specifications allow us to document our data in the same places we're defining them, share common definitions, and leverage them across a variety of use cases. All of this means it's easier for us to answer questions, easier for our non-technical partners to answer their own questions, easier for technical partners to work with our data, and easier to expand the ways we can use our data. In this talk we'll go over what Formal Metrics are, build out a quick set of definitions using them, and see how much easier they can make the work we do as data scientists.
Cait Riggs
Data Science Consultant - Slalom
Cait is a data science consultant at Slalom in Seattle where she works with clients to create solutions with their data like building a model to capture trends in-store feedback using NLP and statistics and enabling data science teams to productionize those models with MLOps. She's had the opportunity to work with many different data science teams at different MLOps maturity levels and is excited to share learnings with the data community. Before pursuing data science in the Seattle area, she worked in geospatial analytics building pipelines in ArcGIS to productionize environmental impact analyses and made maps for survey crews.
Deploying & Monitoring Models with MLOps
Session - Cait Riggs
Machine Learning Operations (MLOps) has quickly become an important part of data science teams as they mature from producing proof of concept models to needing to deploy the model into a production environment. How do we turn the proof of concept model, often captured in a manually ran notebook or python script, into a trustworthy, systematic pipeline capable of monitoring new data, running on a schedule, and more? In this talk, we'll go over MLOps as a lifecycle of processes and learn how to begin to choose an infrastructure stack for MLOps.
Shailvi Wakhlu
Sr. Director of Data - Strava
Shailvi is the Sr. Director of Data at Strava. As a data professional, former software engineer, and entrepreneur, she has built great products in companies such as Salesforce, Fitbit, and now Strava. Shailvi is passionate about creating equity in the tech world by providing self-advocacy tools for underrepresented groups.
Workplace Self-Advocacy for Women in DS
Session - Shailvi Wakhlu
Many women are shy of accepting praise or highlighting their professional achievements to the detriment of their own career success. Women in DS have additional challenges, as the field has fewer mentors available. In this talk, I will highlight specific ways women can reframe negative perceptions of advocating for themselves, and become comfortable sharing their wins effectively and authentically. Everyone deserves to be rewarded equitably for their amazing achievements!
Jacqueline Nolis, PhD
Chief Product Officer - Saturn Cloud
Dr. Jacqueline Nolis is a data science leader with 15 years of experience in running data science teams and projects at companies ranging from Airbnb to Boeing. She is the Chief Product Officer at Saturn Cloud where she helps design products for data scientists. Jacqueline has a PhD in Industrial Engineering and her academic research focused on optimization under uncertainty. For fun, Jacqueline likes to use data science for humor—like using deep learning to generate offensive license plates.
Getting started with neural networks in R
WORKSHOP - Jacqueline Nolis, PHD
Deep learning sounds complicated and difficult, but it’s really not! Thanks to packages like Keras and Torch, you can get started with only a few lines of R code (even on a GPU!). Once you understand the basic concepts, you will be able to use deep learning to make AI-generated humorous content! In this hands-on workshop, I'll give an introduction to deep learning by walking through how to make a recurrent neural network model that generates weird pet names like: Shurper, Tunkin Pike, and Jack Odins. By the end of this workshop, you'll have your very own neural network to take home. If you understand how to make a linear regression in R, you can understand how to create fun deep learning projects.