Jonas Christensen 2:23
Serg MasĂs, welcome to Leaders of Analytics. It is so good to have you on the show today.
Serg MasĂs 2:31
Thank you for inviting me.
Jonas Christensen 2:32
It is my pleasure all together. And we have some very interesting topics to talk about today because you are a data scientist in an industry that I think touches everyone every day. And there's a lot there to explore in terms of what you do there and how you use data science to basically feed all of us. I'm not going to go into the full detail of that, because we want to hear from you. And then you also are an author of the book ''Interpretable Machine Learning with Python'', which is a really interesting book that we're also going to hear a little bit about, but we should hear it all from you. So perhaps Serg, could you start by telling us a little bit about yourself, your career background and what you do?
Serg MasĂs 3:19
Sure. My career, it's kind of a very long and winding road to data science. There's quite a lot of variety. You know, my colleagues, also, there's quite a few. So I'm not gonna say it's unconventional, but it's certainly different. Like I started in web development because back when I started two decades ago, that was like the shiny ball. The same way AI is right now. Everybody wanted to work in web. And so that's where I worked. Plus, it was interesting for me not because it was popular, but because the knowledge of the world kind of converged. And that knowledge has become organised over the years. It can become plentiful and slowly I got into data science because I realised that there was so much I could do with the data. Even as I was working in web development, I realised that there was a time in which there were limitations on the data that was stored in databases and how much bandwidth and how big the websites could be, and so on. But as those limitations started to be lifted, there was just so much interesting data coming in and being stored long term: Visitor data, engagement, how long they were on a web page, and so on and so on. So, my foray into data science was first and foremost, like through kind of analysing those metrics, at least professionally. And so I started specialising in that and I started becoming less focused on the building part. In fact, the building part started to be more informed by that data coming in. You know, the usage of the websites. Which is the way it should be, I think. You're tailoring the product to the user with the data they provide, after all. So, I thought that was beautiful, the way it would just, kind of, serve that purpose. Then as I said, I became less and less interested in that. And I created a startup, which was a search engine for things to do. So events and places. The idea behind the search engine was to create spontaneity. So how do you really create spontaneity? And I thought ''Okay, well, how about you shake your phone, and it tells you something, quasi-random to do?''. And then something nearby, something practical. And you can swipe right and left as you do with many apps these days. And yeah, that idea ended up taking me to Boston. Was incubated by the Harvard innovation labs.
Jonas Christensen 5:53
Oh, wow. Just to give a bit of context, Serg. What year are we in here? Your 20's? When you talk about web, it's literally the history of the internet almost.
Serg MasĂs 6:06
Yeah, for sure. Like, let's put it this way. I've worked as a web designer, web developer about eight years, ending, like in 2006. And then I became somewhat of a manager of web development projects. And then I started to get more into web marketing, analytics, and so on, for the following, like, eight or so years. And so I eventually was like, the webmaster of a large poker site. And that was - I felt like I had reached, like, the pinnacle of what I could do in my field. You know, I was directing the web development kind of projects and at the same time, I was, you know, looking at the analytics and figuring out how to best enhance them. And, you know, there was just so many different things. The marketing and, you know, CRMs, and so on. And so, yeah, like, that takes me to, like 2015. And that's about the time I founded my startup.
Jonas Christensen 7:05
Yeah. So at this time, trying to put your context, your startup into the context of the technological evolution. So people are doing stuff on their phone, and we have iPhone 6, or something like that, at that time. Yeah.
Serg MasĂs 7:20
Yeah. And I kind of skipped that. But in the process between 2006 and 2014, I also got into mobile development. So I learned a lot about mobile development. I worked in a mobile startup. I also freelanced making mobile apps. So I was well versed on that part, as well. So I took all those skills: the web skills, the mobile skills, the marketing skills, and so on, and I channelled all those efforts into this search engine app. And so I had all the skills. I guess, the one skill, I didn't have sharpened well enough when I started that, was the data science stuff. Of course, I had to analyse and stored large amounts of data. But this was the first time I ever did any machine learning. I had done like logistic regression, but you know, like, really challenging algorithms. I haven't done any K-Means or like neural networks, or anything like that. So I started working on that problem. And I realised I needed machine learning. I needed to store data in a structured manner. So I started using Cassandra and MongoDB, and all these things. I started working with geospatial data, because it's all places and events. So I need to learn more about, you know, how to store this, and how to query it, more importantly, and see if a point was in a certain like location and a certain radius, and so on. So, like, that was a huge learning experience. Technically.
Jonas Christensen 8:52
I must say, You didn't start with the most simple data science problem you could start with.
Serg MasĂs 8:57
No, no, not at all. But that's what I wanted to do. I felt like all these mobile apps were trying to solve a problem, which was ''Let's get people to the events that they're interested in or the places they're interested in'' and so on. But they're missing out on spontaneity. That was the premise all of this. Because the thing is, if you open that app, and then a person next to you open the app, they would get the same list, either sorted by price or sorted by rating or whatnot. There was no difference there. And you two are two different people. So how do we create that in a digital era? You know, that idea that you're just walking down the road and you happen to find something and you know, nothing about it. The way it used to be, you know. You would stumble upon a restaurant, you didn't know what rating it had. And then it happened to be the best experience and I think that idea of expectations kind of trumps the enjoyment. Because when you spend a lot of time doing something, you heightened expectation. So you spend a lot of time optimising a certain thing, a certain activity, your expectations become higher, and therefore any little thing that happens is a letdown. So that was also part of the philosophy. My co-founders are neuroscientists. So that was like an important part of the offering. We're trying to make decision making easier, and kind of colouring all these things that actually lessen the enjoyment of our leisure time.
Jonas Christensen 10:34
So then you have this app, and you're not doing that anymore, because you're a full time data scientist somewhere else. So how did you end up then moving away from that and becoming?
Serg MasĂs 10:45
Well, it's simple. What happened was the ideaI - I still think it was a great idea. But the timing was off. Maybe it was ahead of time in some ways. But it was certainly behind time in terms of funding. Because, like, what investors would tell us after a pitch, you know, is that they loved the pitch. And they liked the idea, but they would have invested it two years before, when they were investing in this kind of idea. But now they were investing in other things. So that was very difficult. But at the end of the day, hail mary would have been to win a competition. We were competing at - Harvard has this competition for startups. And the prices are incredible. Like, $150,000, which is huge for a startup. It just needs a little bit of funding.
Jonas Christensen 11:39
It's proper seed funding, really, if you can win that.
Serg MasĂs 11:43
Yeah, exactly,exactly. So we had some seed funding, but this would have been enough to kind of take us to another level, where we could have got, you know, proper, like, rounds of investment. But that didn't happen. We got really close. We were among the top five places, three of which would have won that award. But because of that, that was like the end of the road for us. You know, we were out of the runway. We had no more funding. And so I had to figure out ''What do I do next?''. And what I had realised making that product was, as much as I liked the product, what was most interesting to me, what were people using what were the statistics of people using my product? You know, in fact, it was annoying, because since we had such limited funding, we couldn't afford to hire other developers to fix bugs, and so on. So I had a list of bug fixes. Huge, you know, like, and it annoyed me, because it was like ''Okay, I'll take care of the most important stuff. And this other stuff. These freak bugs, I'm going to get to it later''. Because to me, it was far more important to look at statistics and to see how people were engaging with the app. And you know, other things, you know, what their opinion of it, their comments analysing it. You know, so many different things that I was more interested in. And I realised ''Well, that should be what I do next? I should commit to this field.'' instead of defining beacuse I had to find myself all those years as a builder. But at the end of the day, my heart wasn't in it as much. The products I wanted to build had more to do with the data than when, you know, like, an end user per se, you know. So I felt like, ''Yeah, and I should do this''. So I started to look for schools to study. As I usually do with my decision making, I pull out, you know, at the very least I pull out a spreadsheet, and I start to rank things. And, you know, so I looked into cities, places, potential salaries. You know, like, all sorts of things, all data, put it there, kind of rank it, sort it, put a weighted index on each one, and then realise ''Okay, I should apply to these top schools, these these 10 schools''. So that's what I did. I got accepted by a few. I went to one in Chicago. That's how I really committed to data science.
Jonas Christensen 14:16
Great story, and you're highlighting something, or maybe you're not highlighting it, but this is what I'm picking out of it, at least. I've had a similar journey and startups actually have this hidden benefit: That they really teach us a lot about ourselves, whether we succeed or not. Success measured in the monetary sense of: Did we become multimillionaires or not? But they actually teach us a lot about ourselves. And you often see people finding what they really, really, really want to do out of that process because you're forced out of your comfort zone a lot. I had few years ago, a startup in kind of a similar situation where we'd worked on it. My brother-in-law and I had founded it and we had worked on it for a few years, and we were sort of looking at getting funding. We got some investments, but then over time we ran out of steam. We ended up selling that business to someone else. It's still there. But the point was: I found out that I really didn't want to do that for the long term. I actually wanted to go to data science, which I've worked in the past. And I found myself through that. My brother-in-law found that the content we were creating - this was a marketplace for cycling products - the content we were creating around cycling products in the cycling scene, he loved doing that. So he's become a cycling coach and a professional YouTuber out that. So he's found a completely different direction. So the business was not that successful in that sense of it didn't become a multimillion dollar success and we didn't get a big house at the front of the beach and all that stuff. But we did find ourselves. And this is kind of what I'm hearing from you as well. That you've really taken this journey, and you've actually found a really happy place. Is that fair to say?
Serg MasĂs 16:06
Yeah, definitely. It was a huge learning experience. I think I wouldn't be here if it wasn't for that, literally. Not only in data science, but I probably wouldn't be speaking to you. Because I was, like, super shy. You know, even though I had all this career behind me. I was no like stranger to speaking in boardrooms and CEOs. Even though that was my case, I was still really shy. And I was afraid of speaking in crowds, and, you know, putting myself out there, for lack of a better word. And the startup forced me to do that. I think there was probably no better way of doing it. Because if you're not selling your product, nobody will for you. It's really bare bones in that sense. So I found myself in the street pretty much, you know, with a table kind of giving out flyers and talking to people about it. Not only investors, because you have to have people trying your product. So I had never done something that bold, I think, in my life. Even though I thought I was bold, I just was, like, the introvert's version of what bold is. It just ''I'm gonna be bold from my computer'' or something of that nature.
Jonas Christensen 17:18
Yeah, absolutely. We do seek comfort naturally with these things. So, we do tend to trend towards somewhere else and we have to actually pull ourselves out of that, especially when it's a startup. Because typically you might like being the data scientist, but you're also the marketing manager. You're the CEO. You're the CFO. You're the people & culture person and the webmaster and all the roles all combined. They're all there. Just it's one person rather than 1,000 doing it. It forces us to do that: Step out of our comfort zone. Okay, Serg. So now if we fast forward to today, you are a data scientist at a company called Syngenta. Could you tell us about that company and what you do there?
Serg MasĂs 18:02
Okay, Syngenta is a large agribusiness company. Like, the products, many companies of their nature do. They sell like chemicals to spray on the ground: herbicides, pesticides, insecticides. Also, they sell seeds. That's the core of the business. They've been doing that for a long time. They're based out of Europe. So, they kind of follow very strict European standards for all those things. But their focus, as of several years now, has been on sustainability. Has been on the environment. And it comes at a good time, not only because of climate change, but also because the practice of agriculture, it's become something quite unsustainable in the sense that, really, there's a very complex set of conditions that have to exist for your yield to be maximised. And that's what farmers are seeking, not once but season after season. So it's really tough to keep up with that, as people also increase their needs. You know,like, there's more and more people and among those people, more and more are seeking the same kind of foods and the same level of quality that exists throughout the West, right. So, like, a lot of what we consider staples, like soy, wheat, corn, sunflower oil, a lot of these things, these commodities are as proven by this conflict in Russia and Ukraine, not only imported, but existentially important for countries. Their resources that are relied on for the livelihoods of billions of people. So yeah, my focus in Syngenta is not on the chemical products, not on the seeds, but on something else, which is like a new part of the company, which is called digital agriculture. And the focus of digital agriculture is not like 'Okay, we're still relying on the same tools. We're still selling pesticides and all these things''. But the importance of it is the way we're applying it. It's no longer ''Okay, we're going to indiscriminately spray as much as possible''. But before, a farmer had no way of knowing what was the best practice, and they didn't have the tools to monitor it. But now with the cellphone, with satellite images, even with drones, there's so many tools: tractors that are connected, and so many different things that a farmer has at disposal now, allow the farmer to track and monitor and to be precise in the application of these chemicals. And by precise means, it's not only timewise, to say, ''Okay, you're gonna apply these chemicals at this time. But these parts of the farm, you're also going to make sure to plant at this time.'' You know, there's just a recipe that can be formulated to optimise yield, to minimise exposure to disease, while also being conscious of the environmental impact of what they're doing. So, that's also an important part of the recipe. So, what I do for the company - I can't go into huge amount of detail because of NDA, but - I can tell you, I work a lot with plant disease, with plant growth. So, the data I work with and the models I build have to do a lot with these problems. And diseases in particular, well, they can be fungus, they can be, well, not only diseases, but also pests. Also insects are a huge problem for agriculture, but they can be managed really responsibly. So we're focused on that problem. Making sure that the data we have can inform the farmer on these best practices through the models we provide.
Jonas Christensen 22:08
Yeah, it's very interesting, because I'm sitting here imagining what you might be doing. And of course, you highlighted: you can't tell us all the detail because it's commercial intellectual property. But I'm imagining here, there's a global dataset of theories, where you can generalise from. This is how agriculture works. You add so much to this and you keep the pests away and you water this much and you optimise the growth and you need this much sunlight and so on. And then there is the very, very local condition that you have to measure of that particular piece of land in that particular location. How are you collecting these datasets globally and very locally for that farmer?
Serg MasĂs 22:54
Well, the company has a global presence. So throughout the world, we have agronomists that go to fields and collect data, because we have to test what's going to work in this season, in the next season and so on. If we come up with a new product, we have to test it before we roll it out. We also have scouting missions. We also have farmers that collect data for us. A whole feedback loop in there. So we collect data that way, and it's getting more and more precise, while it used to be that someone would go out and say ''Okay, the condition of the ground is excellent or good or great''. Now, you can actually put an equipment and measure like the pH of the ground, the salinity or so many different things. We can figure out exactly also what kind of composition the ground has, which is super important because what the soil has is the livelihood of that farm. If you have good soil, you'll have less problems in producing the crop you have to produce. And depending also, even if the soil is very porous or something, it might capture the water better. And capturing water is really important, especially if you have very dry conditions, or very wet conditions. So, how the soil interacts with the amount of rainfall it gets or doesn't get is very important. Not only for the growth of the plant, but also for the conditions that kind of make a pest or a threat thrive or not. Very moist conditions can lead to fungus, or can lead to a certain kind of larvae creating this little insect that will replicate season after season and eat away at your crops. So, like, there's so many different things. At the same time, very dry conditions can also lead to certain problems as well. So it's really hard to thread the needle on what are the best conditions for what crop in what time. So, as I said, it's a very complex recipe that has to be given to the farmer. And it's no longer like ''The old wisdom still applies'', because it used to be that a farmer would rely on what his father or his grandfather would tell him, because there were constant rhythm to the season. Right? But that's all broken now. Like, anything can happen. You could have a once in a lifetime flood event or a once in a lifetime dry spell, you know, within the same decade. So, farmers have to deal with that. And we're trying to help them anticipate those things better.
Jonas Christensen 25:45
So, you're really offering here data as a service or analytics as a service to these farmers on top of the core product?
Serg MasĂs 25:53
Yeah, we have several products that they're all being rolled out together. We have a product - and this is no secret - It's called Cropwise. And the models I build end up in this product eventually. As farming is seasonal, something I've worked on this season might not show up in the product till the next one, because it still has to be tested.
Jonas Christensen 26:15
Yeah, I can imagine it takes years to sometimes roll these things out. Because you got to pay attention to those seasons, and have at least one season of the tests before the the application the year after. So, you can quickly add a couple of years there. Serg, this is a very fascinating topic. And it actually touches all of us because we all have to eat every day to stay alive. It's pretty critical in that sense. What are the biggest challenges that is facing our global food system? And how do you see data science helping to solve some of these things?
Serg MasĂs 26:50
Well, there are several. There's climate change, of course. There's poor land management, as well, like, some of the most fertile soils in the world are used for urban development. Or, at the same time, other lands, that are useful to protect the lands that are fertile, are being developed on or were mismanaged. So you have cases of Delta's that were rained, you know. And so now, other lands become flooded because of that. So you have a lot of cases like that. You also have poor farming practices. A lot of tilling in places that shouldn't be tilled. You have cases of the nutrients of, you know, a lot of clubs, soil being kind of blown away. And it seems that you also have the instances of contamination. So, you have a lot of, you know, for instance, like sewage, and the outbreak in the rivers, which end up - there's this whole cycle, you know. They may disrupt the fish, or they may disrupt another part of the system or the environment, that then, kind of, in some way impacts farming. So I think data science can work in any of those levels. And that's just on production. Also, of course, in the part I work specifically, which only has to do with what happens on the farm, and nowhere else. So it's all part of a chain. And then that's just production. Then you have a supply chain issues. So how does the stuff on the farm end up eventually on people's grocery list and then their table and how that part gets optimised. Because there's a lot of waste throughout the system. There's waste in transportation. There's waste happening in the supermarkets, when food is discarded, it went bad, or because it didn't meet certain standards. There's waste in restaurants and in the homes themselves. So there's a lot of also, data science companies or companies leveraging data science, to deal with that part of the problem. But I certainly think we could do a whole lot better as a species in our management of food on all the levels: production, transportation, and then actually optimising the delivery and consumption of food.
Jonas Christensen 29:23
And I think most people don't realise or maybe appreciate how complex the global food system is. I'm by no means an expert here. But you talk about waste. It's quite astonishing how much is wasted. I'm gonna quote here without really knowing but what I've read before is easily 20-30% of what gets produced doesn't actually end up in anyone's mouth at the end of the day. Just gets lost through waste in that process. At the same time, you highlighted the Russia-Ukraine conflict that has led to countries in Africa not having enough food to feed their people. So, we think about that relatively distant and not massive, in the scheme of the size of the planet, plot of land in Ukraine, that is leading to countries far away actually not being able to feed their people. It doesn't take much for the whole thing to start being quite shaky in the corners. Yeah, it's such a fascinating thing. With the climate change processes, you've talked about Delta's being ruined. You know, flooding, or the lack thereof, displacing people. We don't even know what that means, when that happens in a place with 100 million people yet. So yeah, very fascinating. I hope you can solve it for us.
Serg MasĂs 30:37
I'm just tackling a small part of that problem. But I think we all have to. I mean - hate to be a pessimist - But I think sooner or later, it's all going to become really obvious to everybody, when people start seeing that one supply chain issue, or one massive catastrophe on the other side of the world is causing them not to be able to have the food they love. And so these disruptions, I think, are gonna become more and more common. I think, in a way, it's a good thing, because I think people are going to be more conscious of where this food comes from, and what we need to do to actually mitigate these problems. Because it can't be all in the hands of people like me or the farmers. It has to be also in the hands of consumers and the choices they make.
Jonas Christensen 31:27
Yeah. So, Serg, here's your opportunity to step up on the soapbox, because I'm interested. You look at this data every day. Look at longitudinal data. You can see trends of weather patterns, crop yields, all this stuff. And you're seeing from around the world where the issues are. For the layman, what are the things that we should all be paying attention to as consumers and how we behave?
Serg MasĂs 31:52
Well, I think we should also focus on where our food comes from, what kind of ingredients are included in them. I don't want to be preaching to everybody ''Oh, this is what you should eat. And this is why'', but I think it is important, we have a better connection. Especially for us. For the people that live in cities, as I have most of my life. It's really hard to kind of understand the journey our food goes through. And then once you understand it, you can realise what kind of changes we could do. You know, like, perhaps it's a question of going to the farmers market every Saturday or Friday, whenever it is in your neighbourhood. Maybe it's not a question of getting everything there. But just getting the things they offer or the things you would eat that they would offer. Because it's taking a smaller journey. Because it's helping someone local. Because maybe it's organic, and you like organic food, and you trust it more. You know, like, there's a whole different set of reasons that apply to everybody. You know, like, if you live in a coastal area, maybe you think ''Well, perhaps why am I buying the seafood in the supermarket? Why am I not going directly to the fishermen?'' There's things that we can do to actually mitigate these things. Also, there might be reasons, and they're highly local. Like, you realise ''Oh, this kind of food that I'm eating is affecting the coral reefs. And we want to protect the coral reefs'' or ''This kind of food that I'm having is creating algae blooms in the ocean and killing the fish''. Or ''It's wreaking havoc in this ecosystem, because you know, they're destroying all the forest to produce this''. So I can't tell anybody like a recipe because you have to kind of find out for yourself. Because, as you've said before, food is very local. And so it's important to find out in your region: What is the dynamics of that food. What kind of journey it takes. Where it's coming from. What kind of impact it's having on the local ecosystem.
Jonas Christensen 33:57
Great, thank you for that, Serg. I will leave it at that and I actually want now for us to just change topics a little bit. It's adjacent because it's about data science, but I'm really interested in your book and hearing more about that from you. So, you're the author of the book called ''Interpretable Machine Learning with Python'', which is published by Packt. Could you tell us about this book and why we should read it?
Serg MasĂs 34:27
Well, because I think it's an important topic in the sense that, like, interpretable machine learning, well - let's first like divide that into parts, okay. Machine learning, okay. So people think of machine learning, they think of, you know, like a drop-in replacement for software, or some kind of enhancement to software. In a way it's a method used to make a product, in a sense. Machine learning it just takes a certain set of inputs and it outputs something. So it can be used for pretty much any case, where you need that kind of processing. Whether it's for forecasting, for prediction, for classification? It can handle it. So the problem there is, as you said, you're building a product and you want to trust said product. And in traditional software development, whenever you want to trust a product, you have to understand all its components, right. And I had this frustration with my startup, which had several machine learning models, and I was asked to find out why a certain output was occurring in the software. And I would dig deep into it and I realised ''Oh, that comes from the model. So and so model is outputting this.'' and that was the end of it. That was the end of the road. There was no ''Okay, what's going on inside the model?''. Because it's a very complicated question to answer: What's going on in the model. So, that's where the interpretable comes in. That's why, fundamentally, it's important to find out what's going on in the model, is because you want to trust the model. Therefore, you want to understand it. Therefore, you want to be able to explain what's going on. And at the very least, kind of have a theory of what's going on. So that's why we use interpretable. It's not something that necessarily comes with 100% certainty, because after all, depending on the kind of model it is, it might be too complicated to actually kind of dissect to the point where you know how one number comes in and then another one comes out. Taking that journey will go through hundreds of thousands of parameters. But you can have a sense of certain things, certain properties about the model, like: What features are most important to that model? What values of each feature are important? In other words, you able to say ''Okay, well, for this credit rating model, the most important feature is your income''. Okay, then you want to know ''Okay, but how is it important?'', then you'll be able to say things like, ''Okay, as the income becomes higher, it goes like this. The likelihood of you getting a good credit rating, or becoming less likely to default on a loan actually goes like this''. So you'll be able to show whether it goes up or down and how. That's another property, you can explain. You can also explain how it interacts with other features. So, for instance, in this hypothetical credit rating model, you can say ''Well, income is important, but also collateral''. So as both interact in a certain way, like, one what might enhance other, or there might be cases in which a feature actually counters the other. In other words, they one may increase the likelihood of default, and another one might decrease it. But they act in tandem. So you might want to know that for a number of reasons. You might want to know that, so you perhaps can engineer a feature, because it's doing something it's not supposed to. Or you might want to remove outliers from the training set, because you realise ''Oh, this doesn't make sense. It's counterintuitive''. So you might want to do something to improve it. So it also serves that purpose. It's not just ''Okay, understanding the model''. But once you understand the model well enough, just as the word ''debug'' suggests for software development, you can also do the same with models. You can realise ''Oh, maybe I can improve it''. And there's a whole set of tools you can use to improve models based on those understandings, whether it's bias mitigation, monotonic constraints, adversarial robustness. There's so many different tools you can use to actually improve a model beyond what traditionally is seen as improving a model, which is ''Okay, let's just make the model more accurate''. Because accuracy is not everything. After all, you can have a model predict for the wrong reasons. And so you have to be careful of why it's predicting something, because it might throw you completely off. So, I don't know, did that answer your question?
Jonas Christensen 39:19
I think it did. And there really is an underline under the words ''interpretable'' in the title ''Interpretable Machine Learning''. And I think one of the best ways to explain to people as in why should they care about machine learning models being interpretable is to maybe talk about how much machine learning models actually run our day-to-day without us even, maybe, appreciating that. If we assume that there are some different types of listeners here, listening to our conversation, so, there will be people who are actively involved in building models. There will be people who are actively involved in using that app, within a business sense. And then there are others that may have an adjacent understanding of how it all works. You put data in and a recommendation comes out, or a classification of some sort, typically. But really appreciating how that interacts with what we actually do day-to-day and how an uninterpretable model can create damage for individuals or group of people, is for me a really, really important thing for everyone to understand these days. Serg, could you give us a few examples of how we are impacted by these sorts of models every day without us really thinking about it and the problems that can arise when we don't pay attention to interpretability?
Serg MasĂs 40:45
Certainly, there are ways that are not very obvious. Like, for instance, perhaps you get recommendations from social media. And the recommendations are biassed. They're taking pieces of information about you that may be wrong, or they're making certain assumptions. And the model is just predicting on them because it's trained on it. So, it might also favour certain groups. Like, for instance, like, we were talking earlier about my startup and how it was in the places and event space, if you will. And you got products, for instance, Yelp: It might recommend you the same set of restaurants, because they happen to be on a well traffic street. So they become better known because of that. That in a sense is is a bias by itself. There's nothing inherently better about that street than the one right behind it. Except it's on a well traffic area. So that's may not be - a lot of biases, they come from the data. They're not okay. The model isn't inherently bad, or you know, it isn't inherently biassed. It just has a way of amplifying bias. So, the first step into looking into biases is tracing it back to the data and the data generation process. So how was that data collected? Was it manually entered? Is there some way of actually tampering with it? Like, for instance, in this case of Yelp: Yes, there is. People can tamper with ratings. They do it all the time. So, I mean, there's a whole set of things that you have to think about in that case. And another example of ways bias affect us - and I think I alluded to it earlier - was credit ratings. A lot of things that we do in a day-to-day, at least here in the US, is impacted by credit ratings. How much limit you have on your credit card? What kind of housing you have access to? If an employer accepts you or not? They might look into your credit rating. So these are also algorithms. A lot of them are informed or misinformed by data they have from us. And we have no access, direct access to this data, often. So yeah, those are two examples I can think of how machine learning is used and can impact us. Depending on where you are, it could impact you, in other ways. There's even machine learning being used in supermarkets these days. The way they have cameras and they're looking at you making sure that you don't steal anything. There are cases in which certainly could happen that there's a false positive. And then you find yourself in an uncomfortable situation where a supermarket guard is chasing after you. So these things, we have to be really careful, whether it's for fairness or safety. How the models are deployed. They can have an impact. Oh, another more recent example has to do with real estate. Zillow recently had to fire a large portion of their workforce. Because they took a model that was designed for one purpose, and they repurposed it to then play Monopoly with their own money. And it didn't go well, because it turns out their model wasn't decided for that. So that's another important thing. You know, like when you build a model, you build it with a purpose. And unless you kind of make sure that everybody understands that's the purpose of that model, and that it should not be used outside of that scope, no matter what. It wasn't trained on the right data, because you don't have the level of confidence. It can do anything else. But what it was designed to do. Then is it should be clearly understood even by people in the executive level. That it has limitations. But when you don't really be forceful about these limitations models have, then it can lead to this callousness that can actually impact a large amount of people that will lose their jobs as a result of it.
Jonas Christensen 44:57
And there are literally hundreds of examples of how machine learning are impacting us in big or small ways, throughout every day. Most listeners on this podcast will have used the internet to find it. And that's probably been served up by an algorithm. They might have gone and found it on LinkedIn. They might have Googled something, and this came up. And the result of that is not necessarily that this is the best podcast in the world, or the worst. But typically, just because of the algorithm deciding that their search term this was this, or in the case of a social media feed that because of some affiliation with you or me or someone else, that this was the right thing to serve up. And that leaves chance out of the picture, in some sense, right? Like you talked about your startup that was trying to create serendipitous moments, or sort of quasi-happenstance moments. But then actually we have less and less of those, because we're being more and more fed with an algorithm thinks we should have. And, you know, the example that I often think of is: you have some problem, physical problem, and you have aches and pains or a cough or something and you Google those things, and what comes up is almost the full diagnosis of what someone might have when they do that. But actually, that doesn't mean it's the right diagnosis. It means whichever website had the best search engine optimization for those particular sets of ailments will come up at the top right. So that doesn't mean you're getting the right thing to diagnose. And typically, you go to the doctor, and it's something else. Don't worry, you're not going to die. Even though the internet tells you that. But I think that's an example of where we actually relying on all these algorithms every day. And we're googling things all the time. And it's an algorithm doing a small example. We talked about supermarkets. Things that are placed in the supermarket may be algorithmically designed. What's on the top shelf on the bottom shelf and it impacts what you buy. It's fascinating and a little bit daunting and scary for all of us. But I think also it puts responsibility on the people who build those. They actually have a design responsibility here, which I think is kind of why this book is so important and this topic is so important. We're responsible as data scientists for actually putting this stuff into the world. We're the mechanics that make or break people's lives and livelihoods with this stuff. Maybe that's not always appreciated.
Serg MasĂs 47:26
Totally agree with what you said. I think something that not only machine learning practitioners, but all people working with algorithms, fail to kind of consider is the human aspect. How humans are biassed. So you just said something. People searching for their ailments. And the first thing they come up with is perhaps, you know, like, the worst case scenario. Like, you have cancer or something like that. And that feeds into our confirmation bias, right. If that's what we were fearing the most, we'll be convinced that's what we're gonna get, right. So we don't consider those things. There's also this use case of this criminal recidivism algorithm that was used in the US called COMPAS. And I have it in my book. I discuss it. And what like the algorithm fails to consider is that judges, they're more prone, like, - if you give them three options: this person is high likelihood to recidivism, this person is medium, and this is low. They're more likely to take the medium and consider it to be high. So it's almost useless to have the medium. Or, if you're going to consider scales: have different scales. Because if you have a scale from one to 10, depending on the judge, they might value a 5 close to a 10 or something of that nature. So it's really hard to come up with something that doesn't feed into our bias. Because you have to consider under what conditions we're making those decisions. Because whether it's a judge or yourself, everybody's prone to these things, and just falling down the rabbit hole where we're interpreting what a model gives us. And whether it's in a search engine or algorithm we use at work, we might not make the decision that is optimal, according to the model, right. Because the way we interpret the data that's given to us.
Jonas Christensen 49:27
It is such an important and really, to some extent, mind-boggling topic. You hear the stories of models that are put to give people prison sentences or what have you. We've talked about another one on the show here from the UK, where during the pandemic, they couldn't sit exams at school in the UK. So some brightspark at the Department of Education had the idea to algorithmically distribute grades to students across the country. And guess what? People from lower socio-economic areas got lower grades, because that's how it had been. And some people who had these particular attributes got a lower grade than someone else who had some other attribute, because that's statistically how it typically plays out. But for the individual, that's, of course, highly, highly unfair. Because you are an average human, you're gonna get an average grade. Even though that doesn't tell you anything about the actual ability of that person. Got pulled back, rolled back, and lots of people got out of a job from that. You can sit here and cringe. But this shows you how dangerous this stuff is. And it's sort of bubbling under the surface every day. And we have a real responsibility for taking it very seriously as people who deal with this stuff.
Hi, there, dear listener. I just want to quickly let you know that I have recently published a book with six other authors, called ''Demystifying AI for the Enterprise. A Playbook for Digital Transformation''. If you'd like to learn more about the book, then head over to www.leadersofanalytics.com/ai. Now back to the show.
So how does one make models more interpretable? What are the dimensions we should consider on this topic?
Serg MasĂs 51:26
Well, the obvious one is model selection. You can use models that are more interpretable. You know, intrinsically interpretable, I say. And by that I mean, like, the simpler the model, the easier it is to you to explain how it works and how it's arriving to a decision. So, like, these days, there's plenty of models that perform almost or just as well as the most complex models but are easier to, kind of, interpret. So the other dimension is how many features. So, when you have a very model centric view, you just throw whatever you have into the model, and it's going to make sense of it. It doesn't really matter if it has 100 useless features, if it has the 10 that it actually is going to use. But the reality is that these 100 useless features create a lot of noise. They actually confuse the model. And even though it uses them less, it still uses them. It still thinks there's something of value there. So, something that you can realise very quickly with a lot of these models is that the more you simplify the inputs, the more reliable or more generalizable the outcome will be. So in this case, less is more. So it serves well to be minimalistic. So, you can either arrive to the same outcome through regularisation or through feature selection. I recommend feature selection. It's really no point in having much more than the model needs. And this applies just as much to tabular as it does to other domains. So the equivalent for NLP is, okay, well, if there's no value in having stop words, remove the stop words. If there's no value in having the ends of words, then do stemming, you know. Like, there's so many ways we can actually simplify our problem, If we only give the model what it needs. With images, you have the same. You could also simplify it by removing the colours. If the colours aren't useful, then remove them. If the background isn't useful, then remove it. Like, there's so much you can do. As for what other things that can be done to make an interpretable. There's a lot more in the model itself. Once like say you realise, ''Okay, well, I'm going to use an ensemble decision tree like XGBoost, because I have to use C boosts because I get the rest results. But I'm still getting like counterintuitive stuff. So what if I constrained the model?''. So these days, you can make sure that certain things continue to be monotonic. And by monotonic, I mean, like, say you have a model for us deciding who's going to get a scholarship, right? So you want who has the likelihood - maybe it's not ''Who's gonna get a scholarship?'' - but it's something like, ''Who has the highest likelihood of passing the final exam, given, like all this historical data?'' But you think well, wouldn't it be completely unfair and counterintuitive that the people that have the highest entrance exams don't align with higher likelihood of passing a final exam. That would, you know, be unfair or counterintuitive. So, you could adjust that. You could put in a constraint. And the constraint, what it says is: As the entrance exam increases, then the likelihood of passing the final exam increases as well. If you look into the data, what you might find is that the reason for the model not learning that is that in some spaces it will be sparse. In other words, you'll have very few people with low scores, very few people with extremely high scores, maybe a few, like, spots in the middle, where you have less examples. So what you end up is with, you know, the model learning something that is not exactly, you know, increasing line, right. So you can make sure of that. So there's just so many different ways in which you can actually make the models more interpretable. And that's just one of them. Because at the end of the day, if you can say ''Okay, as this increases, this other thing increases''. That as interpretable as you can get with, at least with classification models.
Jonas Christensen 56:02
Great. So in summary, there are lots of dimensions to pay attention to. So listeners, I really encourage you to check out Serg's book because it is such an important topic as we've highlighted. And it's only going to become more and more important as we start to have machine learning impacting more of our lives across many industries, and also the regulation that will follow. Because you're seeing it coming here and there. That these things are going to become more critical as a skill set as well for someone to master. Serg, it's just been a really, really interesting conversation so far, but we are towards the end. And I have just two questions left for you. So the first one is, I'm going to ask you to pay forward and tell us who you would like to see as the next guest on Leaders of Analytics and why.
Serg MasĂs 56:54
Okay, I nominate Mikiko Bazeley. She's a rock star - there's pretty much no other word for it - in the MLOps, ML engineering space. I think it's a very important space. It's one myself - I think it's super important - But it's not my cup of tea, because it is really complex. But it's really important. There's just so many things to consider in that space, for deploying models, for maintaining them, for observability. There's just so many different details. Mikiko is a wealth of knowledge in that space. And she has a super interesting background. I think, just the way she went from social sciences into data science. It's a very interesting story. And I think your listeners would enjoy that as well.
Jonas Christensen 57:43
Great suggestion, thank you. And Mikiko will definitely be hearing from me. So last question, Serg. Where can people find out more about you and get a hold of your content?
Serg MasĂs 57:55
I'm on LinkedIn. And I'm also on Twitter. I think I'm the only person with my name in both, so shouldn't be hard to find me there. Also, if you go to search www.serg.ai: That's like my personal website. You'll find like a blog there and more details about me there as well. And yeah, also, my book is on Amazon. You can also search me there. I don't think there's another one of me there. If you search my name, you'll find my book there. That's how you can find me.
Jonas Christensen 58:32
That's great Serg, and we'll put all that in the show notes for you, listeners. So you can check it out there as well. Serg Masis, thank you so much for being on Leaders of Analytics today. It has been really, really interesting to learn about quite a few topics. So we covered your entrepreneurial journey. You told us a bit about how someone might find their entry into data science in a roundabout way. We've learned about agriculture and the science behind that, that's becoming increasingly complex and how important that is, and then also about your book. It's been a big tour of different topics. And I really thank you for your contribution to the data science community in general because you are someone who produces a lot of content and also in book, in written form and also video. I've seen online. So thank you for that and all the best for your future journey ahead.
Serg MasĂs 59:26
Yeah