Reuse, Collaboration & Deployment | Version Control, Data Lake, Repeatability
Our first article, Data Matters, focused on the “The 4 V’s of Big Data” and how everyone can benefit from being more data-driven. In subsequent articles, we listed ways people are extracting value from big data, both through Data Visualization & Advanced Analytics and Modeling & Machine Learning. Now we would like to tie all of these concepts together in the modern era by looking at Reuse, Collaboration & Deployment.
Let’s start at the end: deployment. If you’ve been following our series of articles, you know there are some great ways you can extract value from big data. Smart data visualization locates the most informative data and analyses then presents you with a wide array of relevant visualizations, and Machine learning allows you to step beyond this to ask questions like “what if?” and “what next?” to gain further insights. Ultimately, turning insights into tangible value requires some type of action. In this sense, deployment is the most important type of collaboration – people and systems working together to extract business value based on the data.
Let’s look at some common deployment configurations:
The most common deployment of data visualization, analytic and modeling projects is the good old static report which captures a snapshot of insights from a fixed dataset. But more and more people are reading documents on connected devices, which presents some new opportunities. The same visuals, analyses and models can now being included in living documents and storyboards connected to dynamic datasets, so the reader can periodically revisit to see any changes to their insights based on new data.
Sometimes the data-driven insights change continuously with every new sample, or gaining insight requires a greater level of interaction with the end user. Just like the dashboard of a car, real-time charts, gauges and other visuals can be used in visual dashboards to highlight key performance indicators in dynamic datasets. In addition, adding interactivity allows end-users to drill in and filter to gain greater insight. With the increasing availability of web page widgets, the lines are being blurred between dashboards and full scale web applications.
Machine learning is really good at creating detectors, classifiers and predictors which can be used to automate data-driven decisions in real time. These decisions can be deployed as agents, constantly monitoring new data as it is collected. If action is required, these agents can alert users through dashboards, emails or text messages, or they can be configured to fire callbacks on webhooks to automate decisions directly.
Optimizers are the ultimate extension of a model. Instead of just forecasting the output of a system, optimizers actively search the inputs that can be manipulated for settings which keep the system running optimally; and within its operating constraints. It is possible for the optimizer to be connected to webhooks for actuation, but they are typically tightly integrated with third-party applications using a web Application Programming Interface (API).
Documents and dashboards are largely visual deployments and are often shared as standalone web applications, while agents and optimizer tend to be embedded as background services in larger applications which incorporate a variety of technologies in well-defined user experience workflows.
It All Starts With the Data
Having covered what we are deploying, let’s go back to the beginning and discuss reuse and collaboration. Previously, we listed the 4 V’s of Big Data: Volume, Velocity, Variety and Veracity. In other words, with big data – there is a lot of it, there is more of it coming every day, it takes many different forms and you need to be able to trust it. Ideally, we would like one repository to be maintained as the single “point of truth” – a place where everyone who needs to access the data can securely extract value from it without making personal copies or verifying it over and over again.
This type of repository is sometimes referred to as a “data lake” – a location where all of the raw data, or at least metadata about it, is pooled together, cleaned, validated and indexed, ready to be securely found and consumed by everyone who needs it; no matter how big it gets. Because of the “Variety” nature of big data, a data lake has to handle data no matter what form it takes – whether it’s numbers or text or images or sounds or anything else. And because of the “Veracity” requirements of big data, data lakes needs to establish and maintain trust for the data and securely control access to it.
Integrating a data lake with a big data analytics and machine learning platform brings additional challenges. Because of the “Volume” nature of big data, the cluster processing needs to be tightly integrated allowing the math to move to the data since the data is too big to be moved to the math. Because of the “Velocity” nature of big data, the data lake needs to support streaming applications with edge analytics and control over what data comes to rest, where and when.
Perhaps the most interesting thing about a big data analytics and machine learning platform is that it keeps making more data. Every calculation that is written becomes a new data point. When you create a model, agent or optimizer, or even a library, dashboard or document, these all become more types of data that can be combined with your data lake for you and other people to use and reuse. In the end, you want a platform where everything is organized for collaboration and reuse, where the entire lineage is maintained all the way from the raw data to everything that is derived from it.
The Cycle of Discovery
The interesting thing about data discovery is – it never really stops. With big data, there is always more data to analyze and there are almost always new ideas that pop up based on every new insight. In this sense, deployment is not so much an endpoint as a milestone. It’s the physical realization of the value you have found so far, while you continue to look for more.
For example, let’s say you make Gizmos. You decided to dig into your factory production data and smart visualizations highlighted some inefficiencies in your processing. This led you to use machine learning to model that part of the process and identify some potential causes of these inefficiencies. From there, you adjusted your process and created some dashboards for your team to better monitor your processing. You may have even created some agents to prevent the bigger problems from happening again.
At this point, two things are happening – one, you’re pretty happy with the money you’re saving by improving your processes. And two, your data is continuing to evolve, especially after these changes. For both of these reasons, you continue to analyze your latest data and develop new data-based ideas to further improve your process.
And then you realize the power of another V – Versioning.
When you’re dealing with creating continuous value, you need a way to draw a line in the sand each time you gain a new insight. The most obvious example of this is modeling. Let’s say you create a model of your processes based on your most recent data and you use that model in some deployed applications. Later on, you update your model based on new data. Not only do you need to identify which model was built on which data, but you also need some type of gatekeeper to ensure that only the best model is released to the production system.
Data Provenance is the key to any big data production system. It is the ability to identify which data and processes led to which insights and analyses so that you can accurately reproduce any current or previous results. Combined with versioning, it allows you to incrementally create new solutions to issues based on your ever-changing data. This concept doesn’t end with your data and models, however. Every asset you build as part of a production system requires versioning and provenancing support.
Collaboration and Reuse
We’ve mentioned before that everyone needs to be more data driven, and those with different roles in an organization can contribute in different ways. Citizen Data Scientists frequently collaborate both with people who know more about the processes behind the data (Business Domain Experts) and people who know more about the processing behind the data (Data Scientists).
In the above example, you probably aren’t producing Gizmos by yourself. The people on the front lines may have their own ideas for things they could investigate to create more value. These could be driven by the dashboards you created or smart visualizations and models they create from the data themselves. Similarly, you may want to pull in the big guns and get a data scientist to find new ways to look at your data.
This collaboration revolves around the ability for these individuals to not only share data and ideas but to build on what each other creates. Anytime someone discovers a relevant new visualization or model, they should be able to make it available to everyone on the team to use and improve upon. And this is true not just of specific visualizations of specific data. Ideally, any user should be able to take a concept and apply it as a template to other datasets or processes.
This expanded Cycle of Discovery makes a versioning and asset management system even more important. At any given time, any number of data explorers could be using or improving different visuals, analyses and models. And through data provenance, each of these people needs to know which version of the data and tools they are using so they can maintain and reproduce their results.