Oliver Schabenberger, Chief Innovation Officer at SingleStore sat down with David Yakobovitch, publisher of the HumAin Podcast and Head of Education and Training at Singlestore to discuss the role data science plays in making breakthrough innovation possible.
Listen to the full episode, or read the transcript below.
Welcome to our newest season of HumAIn podcast in 2021. HumAIn is your first look at the startups and industry titans that are leading and disrupting ML and AI, data science, developer tools, and technical education. I am your host David Yakobovitch, and this is HumAIn. If you like this episode, remember to subscribe and leave a review. Now on to our show.
Welcome everyone back to the HumAin podcast. Today, we're going to be talking all about data, analytics, decisions, and intelligence with the Chief Innovation Officer of SingleStore. Presenting to you today, Oliver Schabenberger. Oliver today is the Chief Innovation Officer at SingleStore, where we are building the database of choice for data intensive applications, fast applications, and fast analytics. Oliver. Thanks so much for joining us on the show.
David, thank you for having me. I'm delighted to be here. I'm passionate about data. Passionate about machine learning, artificial intelligence, analytics, data intensive applications. We have a lot to cover. I'm looking forward to it.
Absolutely. It's a very fast moving industry, and just to start off our listeners, can you share with everyone a little bit about yourself and why you're excited today to build in the industry around machine learning and data?
Well, my love for data started many, many years ago, decades ago, actually. I was a forester and I specialized in the area of forestry that focused on getting information and insight from data collected about trees, about forests, and making predictions, and that turned me on to the discipline of statistics. I became a professor or associate professor in statistics at Virginia Tech, and after a few years there, I decided to join SAS Institute as a software developer.
I wanted to contribute to the building of technology, Analytic Software. I was doing that for 19 years, actually, and grew through the technical ranks at SAS to an executive role. Build some high performance analytics environment, helped that company build products around advanced analytics, AI and machine learning, and join SingleStore because this is such an exciting area and exciting era and time to be working with data.
Everything is digitally transformed. It's a huge buzzword, of course, but it's also very, very real, and we experience it in our own lives. I'm a ferocious reader of books, but everything I consume now is in digital form. I love making music. I love playing music, but I started making music through digital means. My pandemic hobby was recording music and you can experience the power of this digital transformation in our world, turning into bits and bytes, turning into data.
So now this is the raw material for: How do we get value from it? How do we turn this into decisions? How do we drive businesses without those opportunities? Such tremendous and being at the heart of this data evolution revolution is just very exciting and building on great technology that I can see the future on the next hill and beyond the next hill. That really excites me, and that's why I'm here today.
I love it. And as you mentioned during the pandemic, we have spent so much time with our devices and the data on these devices. In fact, I was spending 16 hours a day, just glued to my device, and only recently started getting to spend time seeing that as a physical world. But whether we're living in the digital or physical world, data is everywhere. And as you've mentioned, you've seen a lot of the trends and the changes and the connection between AI machine learning, analytics and data. What are some of the connections and insights that you've seen throughout your career all over?
As I mentioned, David, I came from a statistical background and there's a certain view of thinking about data. As a statistician, data has been generated by some process, some process that involves randomness or casting has a stochastic nature and the data I work with is a realization of that.
So, if I want to understand the data, it is just not to build a mental model, but a probabilistic model of how the data came about, and once I accept that model, as a good abstraction, then I use the model to ask questions about the world. Can I test the hypothesis, whether treatment A and treatment B in a clinical trial are superior to each other? Can I predict something?
And so the model is at the heart of it. An abstraction of the world. And George Box is quoted saying, “All models are wrong, but some are useful,” which has then been misinterpreted sometimes as all models, statistical or scientific models, are questionable. But really what he meant was we abstract for reasoning, because we think the abstraction is helpful to take away the noise and see the essence of the problem we're studying. And once we have that essence, we can draw more informed conclusions.
We saw during the pandemic, of course, that all the models that we thought were correct, did change. Digital transformation changed everything and, as you have shared before, every organization that had five-year plans suddenly had to accelerate that in 2020.
Yes. We found that many of the assumptions that are going into our established models and established thinking about industries, about supply chains, had to be questioned. Because those models were not built on incorporating situations like, “Okay, I'm in the transportation industry. What happens if nobody travels for the next 12 months?”
That was just not part of the decision fabric. And so that called into question a lot of the thinking and our models. Financial instruments that you would assume would always be liquid all of a sudden were not. So it's a real important learning experience also for us to understand scenario modeling and not just making a prediction. But, in order to support decisions, maybe I can have a decision envelope that tells me the worst-case, best-case scenario, most likely, that guides my decisions moving forward.
I remember I was preparing for an interview, a media interview in March last year, March 2020. We had just started to get more and more forecast predictions for COVID impact, COVID deaths in the United States, and a new number had come out that day. On that day, there were 80,000 deaths and it was shocking to us. Wow.
Looking back, we know the number of COVID that's much higher, unfortunately, but at the time that was still a very important and useful prediction. We can say the model was wrong and predicted lower than what happened, but it was useful in that it helped us inform policy.
So models are not always bad if they're not a hundred percent accurate, they need to be able to guide the decisions and the need to provide the right abstractions. But we also have a different approach and different thinking now, and that's where machine learning comes in. I have a problem to solve and that problem depends on data. Okay, I need to make a prediction or classification, or I want to visualize something, or I need to find a pattern. Well, I have a plethora of tools that I can go about which tool is the best to use in this situation for this set of data. So it doesn't put a probabilistic model as to foundation.
But if you will, a more practical approach, what is the best tool to use to solve that problem? And it's where those disciplines really would benefit a little bit from coming together. On one side, we have an approach steeped in mathematical statistics and probability theory. On the other hand, it's a more computationally driven approach and it shows how computer science, as a discipline, changed its focus from focus on compute, to focus on data, but it's a very good thing.
Today in applications, we see a similar change in focus. We call it data-intensive applications. What are those? Those applications are not the CPU. The compute is the constraining factor, but data is becoming the constraining factor. Whether it’s the size of the data, the volume, the velocity with which it’s moving, the type of data.
You know, during the big data, the rise of big data, it was not just that there was now more data than we had to deal with before. It was organizations were seeing data in a different form - click stream data, behavioral data - that is not the typical data that you would store in a system of record, and how do you adjust to it?
And there's new information. There are new patterns in the data that I can detect and, what happens if I missed that opportunity and others take advantage of it? So it was also about being able to consume different forms of data in different ways and making new decisions on it, making predictions in a different way and recommending in a different way.
There's so many parts of the data science workflow, as you've mentioned, rightfully so, Oliver, that developer tools are at the heart of the matter. Whether you're thinking of analytics, artificial intelligence, or machine learning, they apply in many different ways to enterprises. In fact, last year I had on both two founders of leading explainable AI and experimental design startups. One is Gideon Mendel, from Comment ML, looking at different experimentation, tracking with the weights and biases on models to make them explainable.
I also had featured on the show as well, Chris Van Pelt. Chris Van Pelt is founder of the Weights and Biases looking also at model tracking. There's so many ends of the data science workflow that enterprises are thinking about solving for for their data intensive applications. And the challenge that we're thinking about today is how do you build these solutions? How do you democratize them when you're in the enterprise that wants to apply applied artificial intelligence or applied machine learning?
Very good question. We see as you mentioned a plethora of tools and with that comes the difficulty of stitching things together and making multiple tools, multiple languages, multiple data architectures work together. It almost feels like we grew up in different worlds a little bit in the data world. We have transactional systems that record the ongoing business. Those things shall not do any analytics because analytics are done in different systems that are purpose-built for that to enable fast scans and they are not good at doing transactions.
Now a third world comes along, machine learning and data science. We need a data lake for that, and to me these separations feel a little bit artificial. They are somewhat based on the technology we've built in the past that was purpose built for a certain use case, and what we're really seeing is the use cases coming together. For example, a transactional system now needs to not just record the state of the business it needs to respond. The system of record becomes a system of response. So if I interact with a customer recording their transaction, an order entry or a fulfillment record. Yes, that absolutely needs to be recorded.
But the important thing might also be that I trigger the next best action decision or should I provide that customer a discount, or do I need to use that information now for real-time pricing? So it's a response to the transaction that now becomes part of it, and those responses invariably are analytical. The same time analytical systems want to go faster. They want to respond in a more real time as when the transaction happens. Machine learning cannot just be the development of clever models in absence of how these models get operationalized, and be used in organization.
So these worlds need to come together and it starts with a fabric. It starts with a data foundation, because they all depend on data where the workloads can all converge. So what I'm excited about at SingleStore is that we can serve all these use cases rather than building these buildings separately. Silos and empires that need to be connected, breaking down the silos is much more efficient.
You mentioned explainable AI. So a lot of questions about when I built these models, what do they call AI models or other business logic, or just these are algorithmic decisions that are being made? The input is data. Some pipeline, some transformation happens. Math is applied. An outcome is produced.
I might predict, or I might assign a customer to a segment, or I might calculate an optimal price, or I might update a credit score and calculate the probability that a transaction is fraudulent. What are the risks associated with those positions? So if we don't understand well how the algorithm comes to its conclusion, it's very difficult to incorporate these tools into a decision type.
The question of explainability, fairness, and transparency of algorithms is not new. We've used algorithms for decades. What is new today is we're using algorithmic decisioning at a much greater scale with much greater speed, and we're using models that are much more complex. When I worked in statistics, if I built a model that had 50 or 100 parameters, it was big.
Still, I could find ways to relatively easily manipulate the model and understand if I change this variable by that many units, that goes up by one or in a different zip code what is my prediction? Although it's not necessarily a good practice, which has changed these variables independently as if they don't are not related to each other. But if you think about large systems, we build neural networks today with millions and millions and billions of parameters.
We can't just look at those and say: Okay, now we understand how the model works, we need different tools that inform the impact, and we try to clean how the systems react and make decisions. I believe in my heart of hearts, that we're just in one of those periods where technology evolves and that we will get to a different class of models that probably will explain itself more easily. That has more contextual understanding, rather than, for example, a computer vision model, just looking at pixels, and based on pixels declared something.
When I think about data you're so right. I got my career started in actuarial science. So I think much on a similar level working in spreadsheets and some of the foundational programming languages with parameters for very structured data. And today, a lot of organizations are expanding from structured data to unstructured data like geospatial data, social media data, video data, sound data, which is providing a lot of new opportunities and complexities on how to tell the story or how to do that processing. What have you seen around this unstructured data boom that we're seeing today in the industry?
We've experienced an explosion of neural network technology over the last 10, 15 years. Amazing to watch an amazing increase in capabilities, in performance, in reliability and accuracy. It was 2011, when for the first time a human interpreter was speed vested in the image net classification, hold up an image, and then within a couple of seconds, so you have to identify the objects and the images. That was a watershed moment. Now, what was so important though, it was the convergence of three things.
None of them were really new, but they came together to give us those capabilities. It was neural network technology, but we were building much deeper networks than we had done in the past. It was the availability of big compute and cloud computing that made that possible. GPU has made that possible for the neural networks. So we had a big compute to solve much deeper problems and we had large amounts of data to train those models as well. And it's really those three that brought the sea change in capabilities. And it was the difference between a natural language interaction system, the chatbot, being kind of, that's not working well, and yes, I can see how you can productize this.
And since then we've seen this technology has been game changing in many industries: autonomous driving, natural language, understanding computer vision. So what's different about those applications compared to AI applications in the past. The modern AI data-driven and AI and machine learning applications are very good at recognizing patterns, and we train them to recognize patterns. So in computer vision algorithms we might feed thousands, hundreds of thousands, millions of images. Images on which we declare the ground truth: this is a tomato, this is a cat, this is a dog. And through replication, the neuronal network, once it is trained, learns to match patterns against the labels we provide.
So it's a pattern matching exercise, and these algorithms got very good at it. Now the neural network will not understand, and will not tell you an image is an animal or an image is a cat because it recognizes fur, eyes, whiskers and ears. But it just recognizes pixels. So when we think ahead, what may be the next generation of those models might look like they might be more contextual or build out from individual component models. Component models for a four legged animal, for ears, for eyes, for whiskers. And then when it was put it together said, “Oh, I think this is a really ugly dog. No, it's a cat. What made you think it's an ugly dog? It's because this is what I put together.” And then you can interact with the system and understand how it drives its conclusion, and then you can correct. Right now, we're not really very good at correcting these neural networks trained on large amounts of data to detect patterns.
Others then, “Okay, I'm trying to figure out what pixels are lining up for you. What areas of the image seem important and maybe provide images that provide a counterpoint to this or emphasize this so you can learn better, but the capabilities it has provided us in the meantime are just staggeringly amazing.”
And that's where understanding the limitations of the technology, what it can and cannot do is also very calming. We talk about machines coming for us, it was around 2017, Terminator style, the world's coming to an end and machines are subjugating us. I guess that's not going to happen, but they're going to come for our jobs. We'll let it happen. But oh no, they're going to augment our jobs.
When we look at careers, as you mentioned, looking at actuary and statistician and econometrics and data analysts. This has evolved into data scientists, data engineering, machine learning, engineer, AI specialists, site reliability engineers, and the list goes on.
Oliver, how do you see the roles of some of these jobs, like data scientists and others changing in the future?
On one end there's data, and on the other end is decisions and intelligence we derive from data and in the intermediary. The connecting tissue is analytics in all forms of analytics, taking a very broad definition. So machine learning, advanced analytics, statistics, econometrics type and analysis are all in there. Visualization that's the toolbox and artificial intelligence is simply what we do with it. Artificial intelligence means you build a computerized system that does a job, or a task or makes a decision that a human could make.
So, what I see is we have more tools in our toolbox. So if you do your taxes with tax software, you use an AI system. That is a computerized system that encapsulates and captures the work a human can make a CPA can do, but you choose to use it in machine code, through a software. It's an AI system, but it's a different AI system from the ones that we're building to date with neural networks. It's one that captures expert logic and the systems today are good at capturing the world. They're good about sensing the world, that's why we're seeing them in computer vision, in writing and natural language and understanding. The role of the data scientists today has a lot to do with building models and building the decision logic and building the algorithmic logic based on which we can make decisions.
A lot of the time is still spent in working with the data, massaging the data, cleaning data, preparing the data. So you have the right data in order to do the modeling exercise. I truly believe, and that goes back to the democratization of analytics, as you mentioned earlier, David, we need to empower all of us to work with data and to contribute to driving the world with data and driving the world with models more.
We cannot rely on the very top of the skill pyramid, on the data scientists to do that. I mean, we need more training, we need to be more data literate. But it also means we need better tooling that allows low-code and no-code contributions using the right building blocks. I should be able to build an application that uses AI without having to be the one to train and understand how to train an AI model.
There should be components and elements I can use to more easily build. So our applications mean that we need to have analytics and AI at the heart of it, and that is why data intensive applications are intelligent applications. We need to make sure that analytics is part of how an application works. So some of that needs to be moved more into the internals closer to the data, then move it higher up at the application layer.
The data scientist of the future is really the person who orchestrates that data-driven decisioning for an organization rather than just building the individual models, working with the data. A lot of that will be part of automation, but orchestrating this, operationalizing it, making sure that the organization actually makes the right decisions based on models, has the right metrics and incorporates models in the right way.
That is important. And maybe, the chief data and the chief analytics officers, maybe they are the true chief information officers of the future because they're working with the information that is driving the business. How do you go from data-driven to model driven? Very few organizations have made that step. Some digital natives are born there in a model driven organization. The model actually is competitive advantage. That you have a better recommendation engine than you have a better pricing engine. So building those and operationalizing it is going to be important. So the future of data science is decision science.
Let's dive deeper into those models. As you mentioned, 2011 is when we saw those initial breakthroughs with image net and computer vision. We've seen in the last few years significant research with open AI and GPT-3 around natural language, processing natural language, understanding and transfer learning in text-based design. Though these technologies have often been closed source and kept under the hood until recently today. Open AI in data that anyone can practice with GPT-3 and creates self-correcting texts to be your own text generator. What's your take, Oliver, on the democratization of these technologies to basically make data science as a service for anyone, even if they don't have the code capabilities.
Absolutely. That's one way of democratizing the use of technology and making it more widely available. And that could be proprietary elements to it at the core, but there could be open interfaces to use it. It is what I can build with it that matters, not whether it is open all the way through to the engine, but how I can consume it. If you think about Memorial Day coming up, people will be traveling, taking photos with their cameras. The quality of the images that we capture today. We're not professional photographers, most of us. But it drives what a professional photographer could do a decade or 20 years ago with much more expensive equipment.
We have technology at our disposal, now that makes us not just a consumer of technology, but a producer. Look at the contents on YouTube, the quality of what's out there that we all can do. We are becoming, and that's actually a term, prosumer we consume and produce at the same time. And data should be the same way. We should be able to produce what we need based on data, not just consume. Even on the consumption side, we're not doing enough of it and we're not making it easy enough to consume data.
Here's a good example: a dashboard, now many dashboards are a bit slow and loading slow. It's three, five minutes for the dashboard to come up. When it's finally up, then I look at the information: Okay. Now I need to ask questions first. Why? Okay. Why are these data points the way they are? Why do I see a trend? I drill in, I have to ask the question. I have to do the drill down and I have to dig deeper and dig deeper in the dashboard. Wouldn't it be great if the relationship would be reversed a little bit and the data would tell me what I need to know, when I need to know.
So the data becomes much more active rather than a place to store the data. And then I drill in, it knows about the business and it tells me what I need to pay attention to.
That's a different relationship with data, but it requires, also, different architectures and different tooling. But moving towards that more people literate technology that helps us this way, it's a very exciting development.
As we're continuing to dive deeper into analytics, decision and intelligence.The entire space here brings up the need for the role of the chief innovation officer. Today, Oliver, you serve as the chief innovation officer for SingleStore, the database of choice for data intensive applications. In your experience, why is a chief innovation officer so essential for growth, especially product led growth out of the company?
Technology is moving very, very fast. So how do you evolve your technology so that you remain relevant and that you may remain successful and can grow in a very competitive space, a very interesting space. I also believe we're just at the beginning of the beginning of much, it's incredible what we can do today compared to 10, 15 years ago with our AI technologies. But it's the beginning of the beginning. So, where are we taking this? So you have features on roadmaps, you have the six months, 12 months, 18 months view of what you're building. What is the 36 months you have or the 48 month view? I've been around software long enough to know how long it takes to make big changes.
To change, to flip the rudder and go into a different direction, It takes time. There has to be room for discovery, room for exploration, room for research, room for vetting of technology. While at the same time, staying connected to the core of what you do and what you're best at. So the idea of innovation is key to success in technology. Innovation is about turning creativity and curiosity into value, and value has to be tied to the core of what we do, core of the business, core of what our customer needs. Innovation to me is not about chasing shiny objects, it's about deriving value for our customers. But helping the customers also leads them and being led by them is what's beyond the next hill.
So when I think about databases, I'm thinking about: OK, what's beyond storing data? How does the system of record evolve? How a system of record becomes a system of response. What are the changes that technology needs to empower? If this is truly what direction, What are the opportunities at the edge and how does this work, for example, in a managed service environment?
So those are things that, where you explore or you discover, you put your thinking hat on, but you don't necessarily want to create a whole new class of products. You want to connect it to what a customer's need today and where they're going. If you'd asked anyone 20 years ago, if we would have the conversations about AI today, that we're having today and the impact the AI technologies are having, fewer would have said would have predicted what we are.
So how do you build technology? That is agile, that is flexible, that is able to adjust to these trends. It's key to then understand the leanings and urgencies of technology in general. Following those two, important ones I see are connectivity and automation. I believe a lot of the things that we're building and we'll be building in the data space or other spaces aren't in support of increasing connectivity and increasing automation.
Some could think of AI as actually automation of automation. So, and then you start getting guideposts and you start to get direction on where you see innovation is going. The second element of innovation is about culture. The innovation is not just about the product, It's about culture. A company that values and nurtures creativity and curiosity and thinking and discovery and exploration.
That's what excites me about this role at Singlestore and working with these incredibly talented individuals that we have, and what is the technology in marketing and sales over the company. And how can we bring this thinking of continuous improvement, and be creative about getting better. It's not about beating a competitor. It's about us getting better at what we do, us serving our customers better and always asking the questions. What should I do today? Am I better at it than I was yesterday? Am I better than two months ago or better than six months ago? Because if the answer is yes, we're heading in the right direction.
So, looking at those guideposts for your vision and hopes for the future of Singlestore at 36, 48 months out, could you tease for our audience anything that you're excited about for the future of SingleStore?
Very excited. We are at an incredibly important part of the data space. As I mentioned, analytical applications need to go fast. We need to become more real-time transactional systems and need to become more analytical. It is all it is coming to us to where we have engineered and we have built technology. This thinking of low latency, high concurrency, it is so in our DNA. But it's at the heart of a lot of the friction today in data and analytics, that organizations felt like they had to build separate systems for use cases. A lot of the constraints were about underlying technology. They were about the compute technology to storage technology, the availability to speed up the network, the number of cores on a chip.
These constraints are increasingly going away. Now initially, we thought memory was going to be ubiquitous and will be cheaper and cheaper. It's actually storage and networking that evolved this way. So if those compute constraints go away, what are the new constraints? The new constraints are around the data itself.
The data you have and the data you collect, the velocity and the volume, and how quickly you can ingest. What kind of information you can do when you ingest, how can you take that and turn the data into decisions and drive the business, and that is our specialty. So having that incredible database, I suppose, to build those data intensive applications, to serve those data intensive applications that are constrained by data, that is an absolutely spectacular place to be in.
That's where I just want to hint at the things that we're interested in and extending the capabilities of a database to drive intelligent decisions in organizations, large and small. When scale doesn't matter, when size doesn't matter, when any data can be processed, whether it's a simple business logic, the expert system that is coded, expert logic, or a machine learning system that has been trained on data itself, that is the future that I see where this all comes together.
And so it's about just incorporating and making that connection between data and intelligence and measuring ourselves on how we are making better decisions. And better can mean many things. It could mean faster decisions. It could be more accurate, and be more reliable. It could be decisions that are more economic. Could be decisions that lead to less natural resource consumption that are better for the world.
I would love to be in the position to say that the work we've done improves lives through better decisions.
Today on HumAIn, has been a conversation about connections, data analytics, decisions, and intelligence with Singlestore Chief innovation Officer, Oliver Schabenberger. All thanks so much for joining us on the show.
Thank you David, for having me. Always delightful.
Thank you for listening to this episode of the HumAIn podcast. Did the episode measure up to your thoughts on ML and AI data science, developer tools and technical education. Share your thoughts with me at Humainpodcast.com/contact. Remember to share this episode with a friend, subscribe and leave a review, and listen for more episodes of HumAIn.