Creating a Better Data Warehouse with the Unified Star Schema, Featuring Francesco Puppini

Episode #30 June 15, 2022

In a recent conversation with data warehousing legend Bill Inmon, I learned about a new way to structure your data warehouse and self-service BI environment called the Unified Star Schema.

The Unified Star Schema is potentially a small revolution for data analysts and business users as it allows them to easily join tables in a data warehouse or BI platform through a bridge.

This gives users the ability to spend time and effort on discovering insights rather than dealing with data connectivity challenges and joining pitfalls.

Behind this deceptively simple and ingenious invention is author and data modelling innovator Francesco Puppini.

Francesco and Bill have co-written the book ‘The Unified Star Schema: An Agile and Resilient Approach to Data Warehouse and Analytics Design’ to allow data modellers around the world to take advantage of the Unified Star Schema and its possibilities.

Listen to this episode of Leaders of Analytics, where we explore:

What the Unified Star Schema is and why we need it
How Francesco came up with the concept of the USS
Real-life examples of how to use the USS
The benefits of a USS over a traditional star schema galaxy
How Francesco sees the USS and data warehousing evolving in the next 5-10 years to keep up with new demands in data science and AI, and much more.

Connect with Francesco

Francesco on Linkedin: https://www.linkedin.com/in/francescopuppini/

Francesco's book on the USS: https://www.goodreads.com/author/show/20792240.Francesco_Puppini

Episode Transcript

Jonas Christensen 2:48
Francesco Puppini, welcome to Leaders of Analytics. It is so good to have you on the show.

Francesco Puppini 2:56
Hello, Jonas, thank you for having me here.

Jonas Christensen 3:00
I am so excited about this episode, we are going to be talking about the Unified Star Schema which I have just learned about from my good mate, Bill Inmon, and you and him are even better mates, you have written a book together, we're going to hear all about it today. But the unified star schema is your invention. And we're going to hear a lot about that in the next hour or so. But Francesco, before we get to that, we want to hear a little bit about you your career background and what brought you to where you are today. So could you please tell us a little bit about yourself?

Francesco Puppini 3:36
Sure. Okay, about studies, I actually am an engineer in telecommunications. So I studied at the University of Bologna, my hometown. And in my first job, I went to work for Nokia. Nokia mobile phones, but also Nokia networks. In my very first week, they said, "Okay, this week, you will have a course on Business Objects". And I said, "Sure, what is that"? And they said, "don't worry". And then I went to this course on the software that now is owned by SAP, SAP Business Objects. And so I got into data on the very first week of my career, and I loved it. And I'm blessed that I have this beginning because I love working with data, much more that I was appreciating telecoms, and it's also much wider domain. So I'm very happy that I ended up there. And since that first week, nothing has changed. I've always been in data, and they simply love it.

Jonas Christensen 4:27
Fantastic. And since then you have done lots of different things in the data space. Could you tell us a bit about how your career has evolved in data since that first week of learning Business Objects?

Francesco Puppini 4:40
Sure, not too much in detail, I will not do it week by week. Yeah, I mean, I started with that. And already in my first project, I noticed that there were problems, people that were not happy. I was just writing reports of business objects. And people were not happy because some data was missing. Some data was incorrect. And the data was in this weird thing called a Data Warehouse. And I thought warehouses usually hold objects, not data. So it was my first impact with the word data warehouse. And so something was not okay in the data source or something also was not okay in my coding or something. But there were always problems, to be honest, so as of late, I continued to do some activity for another client I was always working for either consultancy firms, or being a freelance consultant myself. And jumping from client to client, I was noticing always the same problems repeating. So one day I said, you know, what, instead of working just in front end, like developing data solutions for front end for business people, let me have a look what happens in the data warehouse. So I went to work for this company called Teradata, that you probably know, and the company specialises in data warehouses. So I went there to work for them as a as a permanent employee to understand how the thing works. And I saw a lot of things. Also, I saw the things that explained the reason why a few things were not okay. I also saw a lot of good things in their entire data. But yeah, I understood the problem that I needed to solve was in the back end of the data. And I said, Okay, so I spent some years with Teradata and working with data warehouse, my nature is always working in the front end. But now that I've had this experience in the backend of data, I understand much more. And somehow, when I'm thinking of a solution, I think about it with the mindset of the people who are preparing the data in the Data Warehouse. And that's helped a lot. And this brings us to today, I'm a consultant, I actually got hired recently by a company called Astrato. Astrato, is a BI tool that will be launched in the market, it's actually already there. And we are actually building the core engine of the SQL. And I'm the R&D team, and I'm enjoying it a lot. And here I am.

Jonas Christensen 6:56
Wonderful. And you have definitely learned a lot since those early days of learning Business Objects, because you are now the inventor of the Unified Star Schema. Now, Francesco, I admit that I'm not a data warehousing expert, but I have looked at what you've invented. And it looks really fascinating. And I can see the speed and ease at which to do things. I'm really interested in learning more about the technical aspects of this. And also, I'm sure the listeners of this podcast are equally interested in it. Can you tell us what the Unified Star Schema is and why we need it?

Francesco Puppini 7:38
Sure. Yeah, before that, I just want to tell you that this solution is not limited to data warehouse. I wrote a book with Bill Inmon, I was lucky enough to get in contact with him and get his attention. That's why the book is strongly focused on data warehouse, because that's the topic that made him famous. But in general, this is all about data preparation. And as you know, data preparation does not always happen in a data warehouse. You can have a data lake and some data scientists working on Python and R doing transformations of data. So this technique of Unified Star Schema is definitely a technique that applies to transforming data in general. So what it is about, well, very simple data is there to be consumed. So sometimes I'm making this analogy with the reading and writing, like I usually say, a joke about Italian carabinieri the policemen, they are always two because one is able to read and one is able to write, that's a famous joke that we have in Italy. So I always make this distinction. There is a moment when you write the data. And there is a moment when you need to read it back. So when you're reading data for analysis purposes, you're putting together data that comes from many tables. And this task of sowing the tables together, is sometimes painful. So for example, when I'm speaking with the people who work in data science and machine learning, they say that they hate spending so much time they say 80% of the time is spent in data preparation. So in general, putting the things together in one place, so that you can focus on a particular topic that takes data from different tables is a problem that everyone has in reporting, in analytics, in data warehousing, in machine learning. So I said, Let's do things simpler. Let's help people to put the tables together in a way that is less painful and makes life easier for everyone. That's what it is about at a high level.

Jonas Christensen 9:34
And how do you do that? What is it specifically that you do in the Unified Star Schema that's different to what we traditionally think of as a star schema?

Francesco Puppini 9:43
Well, I tried to think in a more general way. So what happens today, when you're working as a data practitioner, either in the BI World Business Intelligence world as I always did, or as a data scientist, you are always given a task and your focusing on that task. But the person who is sitting next to you is working on another task, but probably they are using the same tables, the same joins and everything. So I said, Oh my god. So there are so many people who are writing the same query. So how about having a mindset of a solution that acts as a foundation for helping multiple people at the same time? So instead of thinking of joining and putting together data, for a specific task, how about designing a solution that puts together data for a generic purpose, so that when you have a specific task, you first go there, you check if that helps you in, let's say, 80% of the cases it will, and 20% of the cases it will not. So it will only partially do it. So you will still have to write some adjustments. But I wanted to put down a foundation that helps everyone who is dealing with data and build something that is really general purpose.

Jonas Christensen 11:00
So Francesco, to really sort of visualise it for listeners, because one of the hard things about the podcast is you can't actually see any objects, you can only imagine it so can you paint the picture for us in words of how the Unified Star Schema works, and maybe some examples of how it solves real business problems?

Francesco Puppini 11:22
Sure, yeah, I actually love to be honest with you, this is my very, very first podcast. So it's like people who do TV all the time, and suddenly they get invited to the radio. So it's different because there is no visual but it's fun. So I want to give you first of all, an image, which I usually do visually. And now I'll just describe it to you, and should be very easy to think about it because it has to do with the beer with the Imagine that you buy two cans of beer. So when you have two cans of beer, and they represent the two tables of your data set, you have basically if you want to put them together, you can either put them one next to each other. So imagine that you put two cans of beer next to each other, or you can put them one above the other. And this is a bit the core principle of the Unified Star Schema, I had the intuition that data should not be merged by join, but instead it should be merged by union. So the join is that operation that would basically it was the first table next to the second table, meaning that the columns are putting next to each other. While when you do the union, you're actually not putting columns next to each other, but you're putting rows underneath each other. And this is a very different operation, where things do not line up immediately. But then later with the operation of aggregation, the one that in SQL you do with a group by clause, then things do line up. And the advantage of this operation is that it is more general purpose. So as a visualisation also, you can imagine that you open your tool for exploring data. And instead of having a lot of places where you go to pick up your things, you have just one place where there is one magic table that is called the bridge, or some people call it the Puppini Bridge, which is basically what I invented. And whatever you want to connect you basically, if you have two tables, you don't connect them to each other directly. But you connect them to this giant hub that is already designed for connecting any table with any table. So if you have a domain of 200 tables, you call me I go to your company, and I will build this Puppini bridge and then your 200 tables will magically be all compatible with each other with a very natural drag and drop without you needing to be able to know what columns is connected to what columns primary key foreign key, you don't need to do that anymore. The Puppini bridge is doing the audio and everything is much easier then.

Jonas Christensen 13:56
Uh huh. So I'm interested in breaking this down a little bit more. So as a user of data, I have always understood the nice structure of data in a data warehouse in a star schema where you have different facts and dimensions in different tables, but also found it very frustrating to have to link tables up to get through one table to the next table to get things from one table because there's no direct link, but only an indirect linked through another table. And this is something that's a bit frustrating when you're joining data, which is kind of what you're describing. So I'm hoping you've got a solution here for all of us that's much faster than that. And that's what it sounds like. Can I come up with an example here and then you can sort of tell us how it works? Let's say we have two tables in our data warehouse we have two tables somewhere. One is the customer table. So it has everything about customer and other one is the product table and you can join the customer and the product by a key so as you said you can join the two tables. And maybe we have a third table, let's add a third table, we have sales of products as a count of number of sales. So you have the sales of a unit number of sales, you have what product in the product table, and you have the customer that bought the products in the customer table, you can join these up. When you're doing the union, you're saying you're stacking the tables on top of each other, which intuitively, then you get mess because the rows and columns aren't the same. So the columns aren't the same. But you have a solution for this in the unified star schema. Is that correct?

Francesco Puppini 15:34
Yeah, totally. Okay, so let me get back to you about this example. First of all, if you have these two tables, customers and products, you will notice and I challenge you, they will not have the key of each other. So the table of customers will not have the key of products, and the table of products will not have the key of customers. Why is that? Because these two entities exist independently. Imagine you go to our website, you sign up. So basically, when you submit your registration, this will populate one new row in the table that they will internally call it clients or customers or users. And so that's is that in a row, but in in that moment, they don't know yet what you will buy. And at the same time when an employee of the company is adding a new product, they add a new row, but they don't put the customer ID because they don't know who will buy that product. So what is the relation between customer and product? Well, it's not the only one. And that's why I'm here with the solution. Because if the relation was only that a customer buys a product, then the life would be super simple. And we would not even need the unify schema, because you basically have table sales that connects the customer and product. But in the real world, sometimes you're not happy with the product that you buy. So you want to return it. So there is a second tables. So the first table is called sales. And it is called fact table. Because basically it's telling us a story that the customer bought the product. But then there is the second fact table of returns where a customer is sending back the product to the company. So how does that happen? Well, today, what happens is that you have two separate star schemas, one star schema will be called the sales star schema, and the other will be called the returns star schema. And they are easy to use, you shouldn't be frustrated, just take your sales table. And then you join it easily to customers. Yeah, sometimes can be called Customer ID, sometimes client ID, but you find your way. And then you connect products and you have your collection of data and you can play around. Then later when you want to analyse the data set of returns, you do the same with the table of returns, you connect the customer and connect the product. But the problem is did Kimball tell us Kimball you know the father of the star schema organisation called the dimensional modelling. He didn't really tell us how to merge two star schemas. So what if you want to make an analysis of sales and returns so basically, you have to fact tables. So if you go through the books of Kimball, he basically tells you don't worry, use Cognos, you will use Business Objects, and the tool will do the magic. It doesn't work like that. So the reality is that when you're using these tools, which I know very well, the things can get merged together only if you as a user know what to do and what to not do. So that's where my help comes into place. Because now I want to teach, I want to help people to make queries that involve multiple star schemas at the same time. And how do I do it? Well with a Unified Star Schema, so instead of having many star schemas, one which is centred on sales, one which is centred or returns one, which is centred on shipments, you have just one unified star schema, where in the middle, you have this Puppini Bridge table. And all around you have a table called sales, a table call returns a table called clients table called products, and also a table called warehouse, a table called the finance a table called cashflow. So basically, you can put all around this Puppini Bridge tables which do not necessarily talk to each other because your cash flows have nothing to do with the warehouse. And when that is the case, nothing is breaking simply is giving no matching results. But it's giving you the two datasets independently. But things are becoming interesting when two things are actually related to each other. So basically, you want to see sales and returns and the relation that they have with each other. You connect them both to the Puppini Bridge and the Puppini Bridge does the job to put them virtually one under the other. And somehow later the magic is happening with the aggregations so basically, you don't need to worry you just drag the sales amount, the purchase amount, the shipment amount, the return amount, all these figures which normally you belong to different facts, tables. And with a Puppini Bridge, they all become compatible with the dimension that they share. Of course, not all dimensions are shared with all fact tables, if you have a table of purchases from your suppliers, that has no connection with your clients, but again, that is all handled. And the Puppini Bridge is basically a matrix of keys that when I prepare it, I create all the combinations that can fit in together. And somehow, you don't need to worry anymore. And whenever you want to watch multiple fact tables at the same time. Now with the Puppini Bridge, it is much, much much easier

Jonas Christensen 20:37
Aha, I think I can imagine this in my head. And you said yourself something here, which is exactly what I'm imagining that you have to basically create rows for all the combinations that can exist in these all these combinations of tables. So am I right in saying that, then you potentially end up with quite a large breaching table that can be very wide, but also very long, depending on how many tables you're breaching. But how does that affect the performance of queries and so on?

Francesco Puppini 21:09
Okay, yes, it's a question that I receive a lot. You're correct in saying that the Puppini Bridge is big. But when you're thinking about the combinations, I think that you're taking it a bit beyond because in reality, if you have five tables, let's say, and each table has 1000 rows, the size of the Brooklyn Bridge will be 1000 plus 1000, plus 1000, plus 1000, plus 1000. So it will be 5000 rows. So it's just a sum of the number of rows of each table, which is still large, but nothing to do with combinations. Because the combinations are already given in those 1000 rows that you have in the table of returns, for example, because in the table of returns, you already have a key of your product, you have the key of your client, and you possibly also have the key of the sales that was originating that process that now is being rolled back with a return. So combinations are already given. So what I'm doing, I'm just putting them all together in a table. Plus, on top of that, I'm telling you that the Puppini Bridge is not necessarily one physical table. So the nice thing of this invention is that you can see it like every table comes with a Puppini Bridge stage, I call it. So if you have your table of sales, and it has 1000 rows, what I do, I will for you the table sales underscore Puppini Bridge stage, and that is basically a table that has the same number of rows as your table of sales, and it is there ready to be matched with something else. So in the moment when you want to compare says and shipments, for example, you take the table of sales, you take the table of shipments, and then you take the table of sales Puppini Bridge stage, and the table of shipments Puppini Bridge stage, then the two tables that contain the keys Puppini Bridge stage, they are being unioned together, like two cans of beer, one above the other. And then you have your specific Puppini Bridge that is not holding all the weight of all the other tables that can be hundreds or 1000s. You just have basically the keys of the two tables of your interest. And in this case, you are building a query that has the same size as the query that you would build today. And that's the magic of the Puppini Bridge.

Jonas Christensen 23:25
Aha, interesting. Okay, so you do have ways to solve potential performance issues and size issues. Francesco, when you describe this, it sounds kind of obvious. Two questions: How did you come up with this idea? And why has no one else come up with that idea until now?

Francesco Puppini 23:45
I love this question. I love it. And I have received this question a few times. But not many times, to be honest. And actually that is the question. Why did we have to wait until now? I mean to me, am I using a particular technology that did not exist? 20 years ago? No, this Puppini Bridge, it could be invented in the age of the Ancient Greece, to be honest. It has nothing to do with the technology available. If I have to tell you why I have invented it and not someone 20 or 30 years ago, there are two things that come to my mind. The first is that I am passionate, crazy Italian. And I'm always questioning everything. So that's a bit of a thing of my character. When they tell me this is the way to do it. No matter if it's about tables or about I don't know fixing a bicycle. They tell me this is the way to do it. I'm always saying a wait a minute, said Who? Who? Who is saying that this is the best way to do it? So let's check if there is another way. Let's ask, let's investigate. So that's a bit of my thing. But I'm probably not the only one who has this attitude of always questioning and trying to improve things. So the other thing that comes to my mind is that the Puppini Bridge is based on the union. But the union is an operation that in the SQL language is quite painful, because when you make the union of two tables, you need to manually add all those dummy columns that exist in one table and not in the other. So it's sort of painful. But when you are working with the Qlik language, which I was doing for many years, Qlik is another tool of business intelligence, which has its own language. In that language, the union is called concatenation. And that's super easy to do, because it's handling for you all the creation of the dummy columns that are not matching. And also the same thing is happening in our in in Python, for example, for example, in R you have a function called bind_rows, and this is coming from the dplyr library. And this is exactly doing the same job. So if you have two tables, one table has the columns ABC, and that the table is column A, B, D, what is happening, this bind_rows command is basically piling up a with a B with B, C with an empty half column, and D with an empty half column. So it's already creating the half columns, like this dummy columns, automatically, so your command your syntax, when you're typing the command, it's super easy. That's happening in our in Python, in Qlik, and the many other languages but not in SQL. So probably, I was lucky enough to be able to bring an experience from Qlik and that existed. So there is a solution in place that is called concatenated fact tables, which is exactly the union of two fact tables. And I started from that. And I said, Why not generalise this, and that's why now in the 2020s, this solution has seen the light.

Jonas Christensen 26:47
Very interesting. I'm actually fascinated by that. Because when I was imagining what this table looks like, I was exactly imagining, when you export a table from Qlik, into a data warehouse, what it looks like, because I've seen the exactly the same, you have the lots of empty cells in the table that is sort of not joined together, but a big union of lots of rows. And you have lots of blanks, but also lots of combinations. And it becomes very large, and typically not very easy to query, slow for a data warehouse. But that's probably how Qlik exports, and the way it stores the data and so on.

Francesco Puppini 27:23
Exactly. This is actually the union of two tables. It's sort of weird, because then you have a big giant spaghetti table. And the end user is confused because you have the column, sales type, and the column shipment type all together. And that's confusing. So the thing that I changed in that is that I am only doing the union of the keys. But the business users don't see that. So when the business users are searching a column that has to do with sales, like sales payment type, they will find it in the table called sales. And when they search a column that has to do with the return, or whatever, they will find it in the table of returns. So basically, I'm not changing the nature of the tables, all the tables that you have, originally, they still remain there. What I union is the Puppini Bridge stages. And this is something that the business users don't see. So the whole dataset is super, super nice and clean, and everything is in place where you expect it to be.

Jonas Christensen 28:25
I think we got to try this at home. That sounds very interesting. I'll let you know how we go. Francesco, you would have had some frustrations, that's typically how things get invented, right? You have a frustration and problems that you're trying to solve. You're sitting there with a business problem or a challenge. Do you remember what it was that sort of got you thinking, Okay, this is how I should do it. This will solve a lot of my problems in my day to day.

Francesco Puppini 28:49
Yeah, but I mean, if you think about it, I wouldn't even call it frustration. When I was younger, and I found a job I was feeling very happy that I was solving problems for people. And I was actually much happier than at university. So I was not frustrated, I was seeing that things are broken, and people were unhappy. And then I was going there as a superhero, and I was fixing the problem. And that was fine. But then, I mean, if you look at the bigger picture. I mean, today the world has a lot of challenges. So my vision is that all the data experts, all the data practitioners should focus on bigger problems. So at the moment in the companies, there are millions of people while I'm talking to you right now, who are just joining client ID with client ID. I mean, let's stop doing that. We have bigger challenges, we have pandemics, we have climate change. I mean, so it's not the frustration of myself at the beginning of my career. I was happy to join client ID with client ID. I was being paid for that. And people were saying thank you with a smile, so my life was okay. And I guess also the data practitioners today who are doing this, they are sort of okay. The point is, we need to go faster to solve problems. That's why I did this public. I mean, I shared my ideas to anyone who can even improve it and expand it. Because this is a project that should be seen as a global collaboration project is a community thing. So I want people to take my ideas and say, Francesco, why did you not think about this? And I said, Great, I was waiting for you. Let's add it. And I want this solution the Unified Star Schema to be the foundation for spending time on bigger problems, and really tackling bigger problems, because this is what we need now, no one will lose the job with the Puppini Bridge, they will just do more interesting jobs.

Jonas Christensen 30:46
Yeah, that's fantastic. And I think for listeners, that is really what Francesco is doing here, there's no patent or product sale attached to the unified star schema. Anyone can do it, and the book that him and Bill Inmon have produced tells you exactly how to do it, and why. So all you got to do is pick that up and have a read and you can experiment for yourself, which I think is wonderful.

Francesco Puppini 31:11
Actually, there is a patent. I did that. Because I No, no, yeah. Yeah, but there is. But listen, I did it because I wanted to avoid that a large corporate would make this solution their own. I mean, it's a sort of oxymoron. But to make it available to everyone, I had to make it my own. So I really want to control this as a phenomenon that happens as a community collaboration, instead of being taken by a larger corporation that makes profit. This is a thing that we should all work on and improve.

Jonas Christensen 31:43
Yeah, I good on you Francesco. Okay, so I kind of got it right, there is a patent, but it is a patent for everyone to use. Beautiful. So, Francesco, where is this going to take us? If we look 5-10 years into the future, there's so much new demand for data science and AI, across the world in different industries. And the way we use data, the way we stretch the potential of different data types, is also going to evolve a lot. Where do you see the future for the use of something like the Unified Star Schema?

Francesco Puppini 32:19
Yeah, well, this is really the question. Okay, first of all, if we want to collaborate, and we are in different places, I think that standards will play a key role. So let me give you an example. We all have in our local database table called country. And we have a standard. So we have the country is ISO code too, which is a two letters code that identifies the country. And we have the same for many other things. So as well as the time, as we have the date. So every database has a calendar. So I mentioned databases, being on the cloud, and being some portions of the database, of course, to be made public, as, for example, on Snowflake, you can share a database with everyone, there is a data market. So I imagine different organisations sharing in a public environment, their databases, where things are identified by ISO codes. So this way, the magic of merging databases that are from different companies in different locations in different businesses, this magic can happen. So the more standards exist, for identifying the common elements, the so called conformed dimensions, like country is easy. But do we have a standard for the regions of the world? Well, not really, everyone is calling the regions in a different way. It may have APAC but in every company, there are no standards. But that's fine. But imagine that the must be a standard for them. Chemical elements, for example, that of course exists, and many other standards. So my vision is that companies who want to share knowledge with other companies for a better world, they will put on a cloud database, all their fact tables, and dimensions, and also other tables that are neither facts nor dimensions, because the domain is actually much wider. And sharing the codes with the standard ISO codes. And this way, I imagine that I have my company, I have a fact table and I want to match it with the countries from that other company. So I just make every company is making available the table together that comes together with the connector. So this Puppini Bridge stage. And so tables are magically able to collaborate although they are in different databases from different companies. And this way, we can really think of a universal database of information that is shared again on purpose, so on a public database, and this allows the construction of a virtual giant Puppini Bridge on all the databases of the world. Yeah, this is my dream.

Jonas Christensen 34:58
What a beautiful dream. To prepare any breach of the world, I like it.

Hi there, dear listener, I just want to quickly let you know that I have recently published a book with six other authors, called 'Demystifying AI for the Enterprise: A Playbook for Business Value & Digital Transformation'. If you'd like to learn more about the book, then head over to www dot leaders of analytics.com/ai. Now back to the show.

Francesco, we're sort of close to the end here. Is there anything that you would like to get across about the Unified Star Schema or anything else in this space that you haven't had a chance to say yet?

Francesco Puppini 35:43
Well, I mean, of course, there are many things, examples, but I think that you are really making questions that we're going straight to the point together also with the perplexity, is it too big or something? So actually, you really addressed somehow the frequently asked questions that I normally see. So this is pretty much the basic picture. And then, of course, I'm very happy to be contacted. So this way, which, with each individual and organisation, we can have ideas together and move forward.

Jonas Christensen 36:13
Yeah, that's wonderful. And I know you do go to businesses and build star schemas, Puppini Bridges, now and then. Actually, the last question I will ask is, what kind of results have you seen for organisations using this technique? What are the outcomes that you see?

Francesco Puppini 36:27
Well, there are two, I would say, the main result is that the so called democratisation of information, because today, a business user who wants to access data, well, in theory, they can use the Self Service analytics tools. And they are really, really brilliant tools, I work for one of them. And then the classics like Tableau and others, they all exist, and they are good. But that's not sufficient. I challenge you, if today, you go to an organisation and you tell a COO, "hey go to the system and show me how you do your self service analytics", well, they will write an email and they will send an email to a data analyst asking please do this report to me, they never go to a blank Tableau sheet, or Power BI sheet, or Qlik sheet or Astrata sheet or anything and say, Okay, now I build my own query, that simply doesn't happen. So the first goal is this: select business users have a real access to a self service environment, where they just drag and drop columns. And these columns magically match together, I have seen this failing so many times that you don't imagine. The idea of Business Objects was great. But then in the companies, because there are different ways of joining tables, they were creating a universe for a particular use another universe for a particular use. And then the business user, were saying, Okay, if there are two places that are different, but I don't know what the difference is, I don't know which one to use. So I write an email to my data analyst. So this is the first achievement making sure that everyone can go to a place where everything is connected together in a way that is really, really the most universal possible, and they do self service analytics. And the second is being able to do data science in a more consistent way. Also, because we didn't mention it in this call, but the numbers sometimes are wrong. So what happens in organisations is that they have reports which do not match what is written on the data source. And you may say, okay, the data source may be wrong. Yeah, of course. But sometimes the data source is right. But the numbers are getting transformed in a way that the break. Numbers do break. So my second achievement is that I imagine a world where reporting is done, and machine learning models are created, where numbers are more reliable, and the topic of reliability of the numbers, numbers which do not get broken by creating joins, this is the core outcome of the Unified Star Schema. Numbers are always matching the source, always, I guarantee. And if you have a problem with that, come to me, I will go to the home of every single person who is telling me this, and that will show that numbers are always matching the data source. And this is to me a really, really good achievement.

Jonas Christensen 39:24
And that's a guarantee for everyone there, if you ever were looking for one. And I think you've described really what I can see as one of the main benefits of the Unified Star Schema is that it actually is a little bit foolproof. So this situation you're describing with a business user trying to get information out of that and joining different tables without having the knowledge of how a data warehouse is structured. So the facts and dimensions and how you have to necessarily join them through different tables and match keys, the fact that they can without even thinking about it always join things together by one single bridging table just makes it so much easier, you just know that it will join. That really, really helps sort of bridge that gap. If I may use the word bridge, it bridges that gap between the novice user and the very experienced user who necessarily understands a lot about the data warehouse and so on, it takes time to learn all that. Now, one last question I had, I must admit, I forgot this question. I have been thinking about it for a while. And I want to ask it before we go. Francesco, you mentioned that one of the things the problems that the USS solves is that there is this loss of data, but there can also be a duplication of data through table joining. Can you explain how those two things happen with normal joints, and why they don't happen in using the Unified Star Schema?

Francesco Puppini 40:56
Sure, well, yeah, I will explain it to you again, with two cans of beer. But this time you go to the supermarket, and you buy two cans of beer, but one like 66 centiliters, and one 33. So the height of the two cans is different. So imagine when you have a bigger can, and the smaller can, if you put them one above the other, you'll never have a problem. But if you put them one next to each other, well, the result is sort of strange shape. Because you know, when you join two tables, and the tables are table has 100 rows, and the other has 80 rows. Well, when you join them together, the result must have typically either 80 rows, or 100 rows. So tables never have the same number of rows. So when you join them together, they need to adapt to each other. So what does that mean? Well, it means that either one of the two tables is becoming smaller, or the other table is becoming bigger. So imagine that you have a table of sales, and they will have shipments. But in general, a shipment for each sale, you can have multiple shipments. So the table of sales says let's say 80 rows, and the table of shipments has 100 rows. So if you want to see all the detail about the shipments, and you want to add information about the sales, then you do a join, and then sales will have to become 100 rows. But then if that table of sales contains numbers, those numbers are repeated. And that's what I call in the book, I call it the fan trap. Yeah, because this is all written in the in my book, The Unified Star Schema. So that is called a fan trap. And that means that numbers are duplicated. And that's why when you go to do the total, numbers do not match the data source anymore. Or you can do the other thing. So you put your foot on a can, and you make it smaller. So this way, the can that was 100 will be only 80. And that will adapt to the other. But then in that case, your table of sales becomes smaller. So it has lost detail. So either you're duplicating data, or you're losing data. And that's bad, because data should neither have duplications, nor losses. And this is solved with the union because you're keeping the granularity of each table. And then natural level of granularity that you would expect to have when you want to see a full detail. And a full detail, of course compatible with your system. If you have a table of a billion rows, of course, you have to make it smaller for reporting. But that's a different topic. So you keep all your tables of your data domain at the native granularity. And this is the big achievement. Because when you have this, then you are sure that there are no losses of information, and no duplications of information, hence, no wrong numbers. This is the magic of doing the union instead of doing the join. And the result is that when you do the reports numbers are correct. Everything is easier. And yeah, it's a better world.

Jonas Christensen 43:48
Fantastic explanation. Thank you so much for that, very clear and super helpful for both me and the business, no doubt. Now Francesco, we're towards the end, I have two questions left for you. The first one is I'm going to ask you to pay it forward. And I'm going to ask you who you would like to see as the next guest on Leaders of Analytics and why?

Francesco Puppini 44:09
Oh, yeah, sure. Yeah, I see that you're asking this question to every guest that was listening to the previous podcasts. And actually, I have a person in mind his name is Serg Masis is the guy who wrote the book that is has been very successful. He brought it not more than a year ago. And the title of the book is 'Interpretable Machine Learning'. Yeah, he's aguy who I met at the conference of data science in San Diego. And now he's a good friend. He was helping me to prepare the conference for the community of data science. And yeah, he's really passionate about his job is working in the agricultural sector. So he's building machine learning models for improving and optimising agriculture and the crops and everything. So I definitely think that his knowledge it should be shared through your channel.

Jonas Christensen 44:58
Brilliant. I will definitely be reaching out to Serg. And thank you for that. That is a brilliant recommendation. And we will get him on the show, no doubt. Lastly, where can people find out more about you and get a hold of your content?

Francesco Puppini 45:10
All that's easy. I think the best way is to contact me on LinkedIn, Francesco Puppini. And yeah, I am the one wearing a tie. I have a funny picture of myself in San Francisco dressed in a suit, I've never dressed in a suit with a tie. I mean, it hasn't happened for a long time. But sort of like having that as my profile picture on LinkedIn. So that's me. And LinkedIn is the starting point. I don't have a website. I like having common places where I can share things with others. So in LinkedIn, for example, I have a group about the Unified Star Schema, so anyone can join it. And yeah, that's the starting point. And then I typically join a lot of webinars, if you Google my name, and especially if you Google Puppini Bridge, yeah, because if you Google Puppini, you will find my sister, she's a singer. But that's a different story. But if you Google Puppini Bridge, you will find all sorts of videos. So people who want to have also a visual about the Unified Star Schema, they will definitely find on YouTube, several webinars where I'm sharing practical demos on Tableau, where I show how to use and what the benefits are. So yeah, YouTube and LinkedIn.

Jonas Christensen 46:16
Brilliant. And listeners, I do also recommend that you do have a look at Francesco and Bill's book, I am reading through it at the moment to learn more about the Unified Star Schema. And it is really interesting equal part of history and general insight into how data warehousing works, as well as how the Unified Star Schema works in that world and makes it even better. So please do go and check that out. Francesco, thank you so much for being on Leaders of Analytics a day. I have really enjoyed learning from you and understanding more about your innovation here. And I can't wait to try that in my own day to day. Thank you for contributing to the world with all you do, and all the best for the future.

Francesco Puppini 47:00
Thank you. It was fun. It was really fun. Thank you for making it happen. My first podcast. So it was fun. It was a pleasure talking to you. So let's keep in touch and with everyone who listened to me, don't be shy add me on LinkedIn. I have a lot of people, I like community, I like sharing. So let's be all united in tackling problems that can be solved with data

Creating a Better Data Warehouse with the Unified Star Schema, Featuring Francesco Puppini

Join Our Free Trial