- D J Patil says data scientists should focus on the problems they want to solve, and develop a passion for playing with data. If they do that, the skills will naturally show up
- He says data isn’t there to replace humans; it's there to augment us — to help make us more efficient, to make us faster, to double check our work
- He feels big data is going to have a huge role to play in healthcare, pandemics, climate change, education, and security, particularly cybersecurity
We arrive a few minutes late, thanks to a traffic jam that is now routine on the Outer Ring Road in Bengaluru. In the lobby of a hotel in Cessna Business Park, which houses Cisco, Flipkart and InMobi, D J Patil settles into a chat with my colleague Sriram Sharma.
Patil comes across as an easygoing guy for a scientist. In the world of data science, he’s a rockstar. He was the first chief data scientist at the White House, handpicked by then US president Barack Obama.
Patil and Jeff Hammerbacher coined the term “data scientist” in 2008.
Patil comes across as an easygoing guy for a scientist. In the world of data science, he’s a rockstar. He was the first chief data scientist at the White House, handpicked by then US president Barack Obama
What is relatively unknown to many is that he has also helped companies like Ola, Flipkart, Saavn and Hike with problems around data. “I’ve been fortunate to give them some advice. But let’s be clear, all the success is theirs, we’ve just provided advice and they do the rest,” says Patil, who lives in Silicon Valley, where his father Suhas Patil is a bit of a legend.
Patil senior was one of the immigrants from India who landed in the United States in the 1970s with $8 in his pocket (in those days, that was the maximum forex you could carry). He founded Cirrus Logic, a successful semiconductor company. He also cofounded TiE, which is now an influential, global not-for-profit organisation.
Over the years, data science has become a major buzzword. This was an opportunity for us to hear from him about the growing field of data science. In the first part of our interview with Patil, 43, we talked about some currently relevant topics: like how data can influence elections, and what are the ethical questions a data scientist must keep in mind. You can read it here. This is the second part of the interview.
How did you become interested in data science?
I wasn’t a very good student. I tended to do quite poorly in all of my classes, but I love science. I had a physics teacher who let me borrow equipment from the lab and I started doing science on my own. I was playing with all these things, so I was learning a lot, but I just didn’t have the math skills.
As a result, I didn’t get into college right away, but in the United States, we have this great thing called community colleges or junior colleges. I was also able to simultaneously petition my way into the University of California, San Diego. So, I went there, after a semester, for a quarter. And I took every math class I could. I just fell in love with it.
And I took every math class I could. I just fell in love with it… I just loved to play with data. I was always using data in some form
I just loved to play with data. I was always using data in some form. And then, when I went and started my PhD to work on chaos theory, the guy who trained me is Jim York, and Jim is the guy who coined the term “chaos theory”. One of the things I was doing was trying to get all the weather data that I possibly could, to look for weather patterns. I was very interested in patterns in the weather. We characterised those patterns, and found it a very interesting type of discovery, which led to a new way of thinking about numerical weather forecasting. So that was, you could argue, in many ways, a type of data science.
People in science have been doing data science forever, going back to Kepler and Copernicus, because they were trying to understand orbits. The goal for me was: how do you take data and turn it into something of use? In this case, how do we take satellite observations, and outputs of computer models, and turn it into a good weather forecast?
How much influence did your father have on your career?
My father really helped me understand the importance of what it means to really deeply understand a subject, as well as the focus that is required to get to that level of understanding. It doesn’t come easy, and you have to dedicate yourself to the subject. My father also taught me the importance of focusing on creativity in the process of building new things. For example, most of our furniture growing up was hand-built by him. And we used to do all our house and car repairs ourselves. We didn’t grow up wealthy, and my father made sure that we understood that the real value is in creating things of value and making sure you are able to add more value than you take. (Here’s a commencement speech that gives some insights about him.)
My father really helped me understand the importance of what it means to really deeply understand a subject, as well as the focus that is required to get to that level of understanding
How long does it take to become a proficient data scientist? It’s an intersection of a lot of disciplines, to start off with. How long does it take, if you’re a student starting from scratch, and would like to be hireable in the tech industry?
I think the greatest thing to focus on, as a data scientist, is not necessarily all the skills you need, but to focus on the problems you want to solve, and develop a deep sense of curiosity and a passion for playing with the data. If you do that and you work on hard problems, the skills naturally show up. You learn the things that are required to find solutions.
Big data and how it’s conjoined on the hip with AI
A lot of the things that we’re seeing in AI were real ideas that were started in the 70s. Why are they working now? It’s because we can collect data in such large volumes now. I was very fortunate enough (sic) to be exposed to many of the ideas of AI that were discovered in the 70s. What we were doing with data science is the beginning of that reinvention, of what was being done back then with AI. Machine learning, AI, whatever you want to call these things, these are new ways to ask a computer to understand, or reason, or make assessments. The most important thing I tell people about this is: data isn’t there to replace a human. It works well when we use it to augment people — to help make us more efficient, to make us faster, to double check our work. A spell checker is a form of machine learning. So these are things that have existed, but now we’re seeing a different level of sophistication.
Machine learning, AI, whatever you want to call these things, these are new ways to ask a computer to understand, or reason, or make assessments. The most important thing I tell people about this is: data isn’t there to replace a human
Online courseware or MOOCs that you would recommend?
The thing I would tell a data scientist is — whether it’s Coursera, Udacity, Upgrad, any of these different things. There are so many programmes out there that you can learn. More important is that you got to play with a set of data. You can do that on these competition sites like Kaggle. There’s (sic) also phenomenal data sets if you wanna work on data from the United States; go to data.gov, and you’ll find all the data sets that the federal government has released.
How will data science evolve in the next 5-10 years?
I still think we have a long way to go with healthcare. Tailored treatments, cancer research, population health. We’re going to see new diseases emerge, like Ebola, and as we’ve seen happen. Pandemics — big data is going to have a huge role to play there. Climate change, and understanding how do we address climate change. That’s going to be another area. I think we’re going to see new ways that we’re going to be able to tailor education, and find new ways to be able to help people learn fast, efficiently, cost effectively. One of the big ones that we’re going to continue to see is data playing an increasing role in the realm of security. In particular, what people refer to as cybersecurity.
Big data and its role in medicine
The big programme that president Obama launched is called the Precision Medicine Initiative. And this is the idea to allow people of all races and ethnicities that are in the United States to contribute their data. And the idea is to build the largest dataset ever of healthy people. A number of those people will eventually get sick like all do, and we’ll be able to study it and find patterns of health, but also the pathways of how we get sick.
If you have cancer and you’re next to a top-tier university in the United States, you can get that tailored treatment. We can treat your tumour with that specialised type of care. How do we make sure that’s available for everyone?
Some people call this population health. You know, try to understand at a much richer level, why does one person go one way and another person go another way. In this will also be very detailed genomic data, the data about how your DNA is ordered, and what that really actually implies.
Precision medicine does exist in a number of areas right now. If you have cancer and you’re next to a top-tier university in the United States, you can get that tailored treatment. We can treat your tumour with that specialised type of care. How do we make sure that’s available for everyone?
Toughest challenges you’ve tried to solve with data
We spend a lot of time trying to figure out the hardest problems in local communities or the local town. One of them is the criminal justice system in the US. We, unfortunately, have a system that has had a lot of people going to jail, and then (when) they come out of jail, somehow they get put back in jail. It is roughly 11+ million people, going through 3,100 jails. They stay on average for 23 days, 95% never go to our long-term prisons. Some estimates are, one-third of these people have mental health issues. These dollars prevent a teacher being hired, a park being built.
So, one of the things we started was what’s called the Data Driven Justice Initiative under president Obama. The idea is really simple. It says: take the healthcare data, take the criminal justice data, and look at it together, and see who are these people that are constantly going in (jails)
So, one of the things we started was what’s called the Data Driven Justice Initiative under president Obama. The idea is really simple. It says: take the healthcare data, take the criminal justice data, and look at it together, and see who are these people that are constantly going in (jails). And instead of taking them to jail, take them in, and put them in the right type of treatment. Doesn’t sound like rocket science, because it’s not. It’s just simply comparing spreadsheets.