Maximizing value in small scale data science projects

In our latest blog post, my colleague Arash Toyser wrote about data and its relation to value using a very intuitive and informative analogy of gold and goldmines, making the point that raw data is the goldmine and the extracted insights are the actual gold. Continuing the analogy, one can argue that gold in itself has no inherent value. A pile of gold nuggets only carry the value that the market is prepared to offer at any certain point in time and this value is only realized when the gold nuggets are being explicitly traded for something else.

In addition, the ease with which the gold can be mined and shipped to the market decidedly has an impact on the profit that can be expected from the whole enterprise. Correctly chosen and efficient tooling is of utmost importance.

Acknowledging that gold is insights, we can define value as actions. Not just any actions though, but those taken based on the extracted insights. Creating real tangible value based on data involves both being able to extract actionable insights as well as making decisions based on it! Knowing from data that something can be improved, fixed, or optimized but not doing anything based on that insight does not create any value itself.

Looking at today’s industrial data landscape, many companies are in the beginning of a journey with goals set at leveraging their collected data as an advantage in the race against their competitors. For big companies, this can involve several teams with a multitude of competences such as data scientists, data engineers, machine learning engineers, software developers, software architects, DevOps engineers etc. All of them are involved in this journey because they know how to solve some piece of the puzzle to realize value out of data. In such a setting, the data scientist usually takes on the role of the gold prospector, creating a Proof-of-Concept (PoC) that demonstrates the gold-producing potential of the current mine, leaving most of the challenges with constructing and operationalizing the mine and the logistical chain supporting it to others in the team.

But what does it look like for other companies? The companies that perhaps are not big enough, do not have the necessary competence, to which software development and data analytics are not a core part of the business? In these companies, it is not uncommon that a single or a few data scientists need to figure out all parts of the journey as well as the related tooling. They need to be the gold prospector, the mining engineer, the logistics expert as well as the operator, all-in-one.

From a data scientist’s perspective this is a daunting scope to take on, for two main reasons: competence and expectations mismatch.

A typical data scientist has great knowledge in statistics, machine learning algorithms and tools related to exploring, visualizing, and analyzing data sets. In addition, most data scientists have, to a varying degree, experience within programming. These are all critical skills for producing PoCs. But when the time comes for the next step, taking the PoC into a production-ready application, the competence requirements widen immensely. Knowledge of diverse areas such as hardware infrastructure, model versioning, access management, scalability, logging, REST APIs, runtime environments, testing frameworks, CI/CD etc. is important. Neither the skillsets in themselves nor the tooling available to exercise these skills have much overlap in practice, creating a big mental overhead.

Besides, a PoC carries very different expectations depending on who the consumer is. The data scientist that produced the PoC knows that it required plenty of manual data cleaning, that it consists of 90% spaghetti code and that it runs only in the development environment on his/her laptop. It serves its purpose of demonstrating the potential well, but the customer, on the other hand, might have the understanding that the PoC is in fact almost a finished application that can be directly integrated in their existing IT system.

Scenarios like these are more common than one might think, especially when the customer is a company with no or small experience of software development and data analysis. It is not far-fetched to think that quality takes a hit, at best, or that the project simply gets cancelled due to being too costly.

It becomes easy to see why the current failure rate of data science project is as high as almost 90% according to some recent studies. Obviously, the full 90% can’t be attributed to the above described challenges but it certainly doesn’t help.

Finding your ideal Minimum Viable Platform

Like I said, it is not until we improve or optimize something that we realize the value of the gained knowledge. In order to accomplish this, there is also a need for correct and efficient tooling enabling users to both extract the insights as well as take actions on them to maximize value. In a data science setting, the minimum viable platform can be thought of as the platform encompassing the minimum required functionality to take your PoC to a production-ready application.

In many cases the minimum viable platform is actually much smaller than first thought. Many of today’s data science projects would comfortably run on a typical virtual node, underpinned by a small set of consciously chosen supporting tools. Far from every project requires the full cloud experience or a Kubernetes cluster running on multiple nodes prepped with all the latest tooling put forward by Google, Netflix, Facebook and the like. It’s possible to start with the basics and scale only if and when needed.

Keeping in mind companies that need small-scale projects, Viking Analytics developed Daeploy. It aims at significantly lowering the threshold in terms of required time and knowledge for testing, deploying, and monitoring business logic algorithms, like machine learning models, numerical optimization routines or sets of predefined rules. If this sounds like the right platform for your business, contact us today for a demo.

About The Author

Fredrik Olsson is Senior Python developer and Software Architect at Viking Analytics.