Almost anywhere we turn, evidence of a data revolution abounds. That realization suffused the inaugural Women in Data Science Conference at Stanford. The daylong event brought leading academic researchers together with 400 attendees from 80 companies and 30 academic institutions to survey the promise and problems of this burgeoning field. "Solutions to the challenges of our future increasingly link back to data and data science," said Stanford Engineering Dean Persis Drell in her opening remarks. Finding those solutions, she said, will also require diversity of thought, approaches, and styles, and, ultimately, of teams. “We can't afford to live in a society that in subtle and not-so-subtle ways actually discourages half the population from careers in this exciting field," Drell said.
Sharing the opening stage with Drell was conference organizer Margot Gerritsen, associate professor of energy resources engineering at Stanford and director of the Institute for Computational and Mathematical Engineering. Gerritsen amplified Drell's remarks, saying: “Data science is a very rapidly growing field of increasing importance. So much research and business decisions are based on data. If we want to ask all of the right questions and analyze all aspects of a problem, we need diversity and multidisciplinary thinking.”
In talks on topics ranging from precision medicine to personalized entertainment, and from data-driven marketplace design to the challenges of cybersecurity, the speakers, all prominent women in the field, demonstrated how, as Intel's Diane Bryant put it, "data analytics is going to transform all industries."
Here are several insights that emerged from this daylong exploration of our unprecedented ability to harness the power of data.
In one of its most promising applications, data science is transforming the way we understand and treat disease, several of the speakers agreed.
"Today, cancer patients are treated with a one-size-fits-all standard care approach. In reality, we all have very unique genetic makeup. Diseases have their own unique genetic profiles," said Diane Bryant, senior vice president and general manager of Intel's Data Center Group. "Two people with the exact same advanced cancer will receive the exact same treatment, and one will live and one will die."
Network theory could hold the key to understanding why that is, suggested Jennifer Chayes, Distinguished Scientist and managing director at Microsoft Research. By using data-driven techniques to model gene regulatory networks, computational biologists can home in on proteins that weren't previously known to contribute to dysfunction and identify viable targets for drug therapies.
"We know that with cancer, we really want to personalize it. We give tamoxifen to two women — it works for one woman with breast cancer; it doesn't work for another woman with breast cancer. It's because these breast cancers are different diseases. They're just showing up in the same tissue."
Indeed, when Chayes' research team applied network modeling and machine-learning techniques to patient data from the Cancer Genome Atlas, a multi-institutional catalog of gene mutations, they uncovered instances of breast cancer that could be treatable with known drugs used for certain kinds of gastrointestinal tumors. "We're really excited about this kind of personalized treatment in cancer," Chayes said, "and we're starting to look for similar things in a big autism data set."
Data analytics is also providing new insights into Parkinson's disease. In collaboration with the Michael J. Fox Foundation, Intel deployed wearable devices to nearly a thousand people with Parkinson's. These sensors track 300 events per second — steps, hand tremors, sleep patterns. This data is being shared on Intel's open-source analytics platform with research institutes across the world, Bryant said. "We unleash then all the brilliant researchers that have had theories on Parkinson's disease — on the treatments and the disease progression — but they lacked the data and the insights to prove out those theories."
In ways familiar to many, Netflix has helped to raise the bar on personalization. Through deep, granular mining of viewer's tastes and viewing habits, Netflix users have become accustomed to getting hyper-specific recommendation sets such as "Raunchy Cult Late-Night Comedies" and "Binge-Worthy TV Shows Featuring a Strong Female Lead." Netflix VP of Science and Algorithms Caitlin Smallwood estimates that members see over 50,000 distinct content groupings in a given month. And it's all thanks to the company's embrace of data-science techniques and commitment to experimentation.
"You might think that there's one main Netflix product, and then a little bit of testing on the side. That's totally not how it is," Smallwood said, adding that Netflix is not alone among today's Internet products where "you actually have thousands of versions of your product that are out there at once."
Developing, testing, and refining different recommendation algorithms has been a constant and continuing process since day one, Smallwood said. Each year the company tests around 500 different algorithms, each with multiple variations, weaving data-based experimentation into the core of its business practices.
Beyond ushering viewers to the next thing in their queues, Netflix also uses a data-driven mindset to inform decisions on which new shows to invest in. When thinking about original content, Smallwood explained, there is little data to consider. But by leveraging the data collected around members' viewing habits, data scientists can build taste-cluster and demand-prediction models to inform the human decision-making process of whether to greenlight a new show. In other words, analyzing House of Cards as if it were West Wing meets Breaking Bad provides a framework to estimate demand for a new show based on the preferences of subscribers who viewed older shows with related characteristics.
"It's a fascinating area," Smallwood said. "How do you merge data science with human decision-making? It's one of the harder areas in data science."
The companies that make Silicon Valley hum don't just have proximity in common, said Susan Athey, the Economics of Technology Professor at Stanford Graduate School of Business. At heart, Uber, TaskRabbit, Airbnb, and even Google are all marketplaces that bring together buyers and sellers. And the primary method embraced by many of those companies to improve their platforms and optimize profits, Athey said, is a siren song.
"The engines of innovation in all these big tech companies is A/B testing and experimentation," she said. To improve fast, Athey explained, these companies run thousands of experiments a year testing one version or algorithm against another, and go with whichever one shows better results. At that speed, there is little time for the type of rigorous analytics that can extract deeper insights and lead to greater long-term growth.
"We get so seduced by that certainty, by that clarity," she said, "that suddenly, when we want to start talking about long-term effects and feedback effects and so on, it sounds wishy-washy. It sounds unscientific relative to the beauty of the A/B testing platform. And that makes it very difficult to make any kinds of long-term decisions.”
Consider eBay's decision to eliminate seller-listing fees — pocket-change amounts that on such a huge platform added up quickly to the tune of hundreds of millions in revenue. "How do you, as a data scientist, decide I'm going to throw away 200 million dollars' worth of revenue on the theory that I'm going to get it back?"
In this case, the change ended up being highly profitable — by eliminating the fees, the company attracted more sellers, who in turn bought more when they came to check on their listings. For businesses that run such marketplaces, the key to using data to inform long-term improvements, Athey said, lies in targeting the right balance between experimentation and theory.
"If you don't have a good theoretical framework to build on and just ask your data to speak to you, you'll never possibly find what you're looking for," she said. "Your off-the-shelf experimentation techniques, your off-the-shelf prediction techniques do not do the job. You have to really modify the techniques for the economic scenarios that you're in. And that's exciting. That means that if you can figure it out, you have unique value you can bring and you can really make a big impact."
Cybersecurity is a particularly thorny problem because there are no fundamental constraints on the kinds of attacks that people can perpetrate, said Celeste Matarazzo, data scientist and cybersecurity researcher at Lawrence Livermore National Laboratory. In short, the opportunities for malfeasance are as limitless as a hacker's imagination. What's more, she pointed out, "It's dynamic. It changes in time. It's not something that once you find an answer, you found the answer for it. You found the answer for that nanosecond and it moves on."
The problem is compounded by the sheer volume of data that moves through online networks every day and must be managed and safeguarded in installations so vast that they're called data farms. "That's a lot of stuff to look at," Matarazzo said.
This scope and mutability means that cybersecurity is a ripe area for data scientists who bring a new rigor to a field that may have been previously considered more tradecraft than science. In addition to focusing on statistical and probabilistic methods, considering diverse approaches and perspectives — from different genders, socioeconomic backgrounds, and disciplines — will be key to defending against future cyber threats, Matarazzo said.
"Cybersecurity is a naturally diverse topic because there is no usual suspect," Matarazzo said. "If we think of a cyber threat as a cancer, our goal is to detect it and mitigate it as soon as possible. So using automatic data science and data-driven techniques I think is important, along with a collaborative, innovative team approach."
A recent study by the American Association of University Women reported that in 2013, only 26 percent of computing professionals and 12 percent of working engineers were women. Greater gender diversity in fields such as data science can bring new perspectives and broader innovation to complex problems. A Harvey Mudd College study showed the important role that women-focused conferences play in increasing the number of women in male-dominated fields by providing networking opportunities. The Women in Data Science Conference aims to do just that — providing a venue for learning and connections with leading data science innovators.
An archive of the event's keynote addresses, talks, panels, and interviews is available on YouTube. The next Women in Data Science Conference will be held Feb. 3, 2017. For more information see www.widsconference.org.