Our first article, Data Matters, focused on the “The 4 V’s of Big Data” and how everyone can be more data-driven. We listed three key ways that you can extract value from your data: Data Discovery, Data Analytics and Machine Learning. The first two were addressed in our second article, Data Visualization & Advanced Analytics. Now we will discuss another key way to extract value – Modeling & Machine Learning.
Extracting Value with Machine Learning
You are probably already aware of what machine learning is, at least at a high level. Instead of software being pre-programmed with the relationships in your dataset, machine learning automatically models these relationships with little to no supervision.
Machine learning allows you to extract value in ways visualizations and analytics cannot do by themselves. First, it lets the data drive the analysis. In addition to the relationships you already know in your data, or think you know, letting algorithms model the actual relationships in your data can reveal new, previously unseen patterns. And second, having an accurate model of your data allows you to step beyond the “what happened” questions of descriptive analytics to ask the “what if” questions of predictive analytics; letting you peer into possible futures.
Here are some of the most popular ways machine learning is used to extract value:
Anomaly (outlier) detection is useful for detecting data points that do not fit with the patterns in the rest of the data. It is done by determining “what is normal” and then highlighting data that is “not normal”. It is great for identifying important things that happen infrequently. For example, a sound sensor attached to machinery could be run through anomaly detection to determine if the machinery is no longer working efficiently based on the sounds it is making, alerting to a potential problem.
- Fraud detection
- Machine failure avoidance
- Medical diagnosis
- Removing noisy data
Classification models categorize new data based on previously identified labels. This is particularly useful when identifying the category labels by hand is prohibitively time consuming or expensive. Once trained, a classification model is able to efficiently label new datasets. For example, a medical imaging application could classify aspects of mammogram images to differentiate malignant versus benign cancer characteristics to help determine if more invasive procedures needed to be performed.
- Medical diagnosis from symptoms
- Optical character recognition (OCR)
- Risk classification
- Target marketing
Clustering brings out the hidden similarities in your data. It differs from classification in that the reason data is grouped together isn’t based on predefined categories. It is particularly useful for finding groups of data that have similar characteristics and therefore may behave in the same way. It is also great for getting a handle on big data by reducing it into chunks without arbitrary filtering. For example, an online advertiser could use clustering to partition their web visitors into groups based on demographics, which could then be used to better target the advertising content.
- Data summarization
- Market research/segmentation
- Gene identification
- Geographic hot-spot identification
Regression analysis allows you to create a model of the interactions behind your data. Like Classification, it’s based on previous observations. However, instead of predicting classes it calculates likely values. It’s “put this in, get this out” nature makes it perfect for evaluating new data along with “what-if” scenario testing. Like Classification, it can also be used for prediction. For example, a retail store could use a regression model to forecast the daily sales based on factors such as day-of-week, holidays, economy, weather. This information could be used to manage staffing efficiently.
- Stock market prediction
- “What-if” scenario testing
- Sales & marketing analysis
- Complex function approximation
Machine learning can also help you go beyond predictions into prescriptive analytics, i.e. the recommendation of actions. Recommendation systems act as virtual assistants to the consumers of your data. Primarily used in end-user applications, they are a great way to share expert insights and highlight opportunities people might have missed. You have probably already encountered recommendations like this on services such as Netflix or Amazon when they suggest additional movies or products, but there are many other uses. For example, driver assistance systems in cars can warn a driver when braking may be needed and even go beyond recommendations to apply the brakes automatically when the situation warrants it.
- Upselling customers
- Highlighting opportunities
- Passive or active warning systems
- Call center dialog prompting
Process optimization is another branch of prescriptive analytics and is arguably the most tangible way to extract value from data. Essentially, it takes a mathematical, regression, or classification model and automates the “what-if” scenario testing process to find the overall best solution. This could be as direct as finding the factors to lead to a single best, e.g. maximizing revenue, or as abstract as finding a solution that best balances positive and negative factors within a set of constraints, e.g. efficient production with minimal waste and maintenance. For huge systems like processing plants and large-scale operations, the overall model can be broken down into a hierarchical mixture of experts or control loops that optimize a particular part of the process while being part of the whole.
- Factory/Production optimization
- Inventory/Ordering optimization
- Equipment/Resource Usage optimization
Smart Data Visualizations of the data and analytics often lead to questions best answered by machine learning. In turn, the results of machine learning are often visualized through smart visualizations leading to even more insights.
Machine Learning Algorithms
Machine learning and neural networks have been around in one form or another for over 60 years now. In fact, many of the buzzwords and new technologies being used today have been around in the literature for decades. What has changed is the volume of data and availability of processing power. Modern architectures and in-memory cloud processing have made it possible to process huge amounts of data in a multitude of ways to arrive at deeper inferences than ever before.
Here are some of the technologies used to extract value from data as described above:
Efficient Linear Clustering
K-Means clustering approaches clustering by partitioning observations into groups sharing the nearest mean parameter values on selected columns. In other words, given a target number of groups, it mathematically locates averages in your data so that each group has a minimum distance from the group average. As mentioned above, not only is this type of clustering an effective way to visualize similarities in your data, but it’s also a great way to prepare big data for other algorithms since it can reduce the dimensionality of the rows, effectively shrinking the problem size.
Principal Component Analysis (PCA)
Effective Input Reduction
PCA is a powerful tool in machine learning that can be applied to feature selection and classification. It uses an orthogonal transformation to convert a set of possibly correlated variables into a smaller set of linearly uncorrelated values called principal components. Put simply, it takes a set of columns and reduces them to a smaller number of different columns by focusing on what makes those values unique. This is also a great way to prepare big data for other algorithms since it can reduce the dimensionality of the columns, shrinking the problem size even more.
Fast, Simple Classification
Logistic regression is a well-known method in statistics that is used to predict the probability of an outcome and is particularly popular for classification tasks. Like a simple linear regression (slope + intercept) formula, it is essentially a function that fits its parameters to best represent all of the data. It is made more effective for boolean (true, false) values by optimizing the equation to best describe the transition between false and true, represented as 0 and 1. The closer the output is to 1, the higher the probability it is true. This concept can then be extended to any Classification problem by comparing the probabilities for each individual class, determining the most likely category.
Detailed, Accurate Models
Artificial neural networks are learning algorithms loosely based on the way biological brains solve problems. Like linear and logistic regression, they map a set of inputs to one or more outputs by creating a function that best fits all of the data. However, they greatly expand on this concept by simultaneously training layers of interconnected weights to form vast functions typically far too complex to read and understand. Multilayer Perceptrons (MLP) are one type of feedforward neural network. MLP’s are well-suited to problems where simple linear and logistic regression models cannot fit the solution.
Complex Inferences From Large Datasets
Deep learning is an exciting extension to the multi-layered approach introduced in neural networks which focuses on the learning of representations. Where classical neural networks might have used a few layers of similar processing to create a functional approximation between the inputs and output, deep learning often uses many layers of different algorithms all adapting at the same time. This essentially allows them to learn how to best filter the data into representations it can use to map the inputs to the outputs through abstraction. A great example of this is image processing. Algorithms like convolution and max pooling can be used to create filters which adapt to parts of the image that are consistent between examples regardless of where in the image those parts are located. These filters then feed into additional layers of adaptive filters and learning algorithms to arrive at a robust mapping from inputs to outputs via abstraction.
Fast Consensus Approximation
Random forest algorithms take modeling of complex problems in a different direction. They are based on Decision Trees, which are essentially large graphs of divide-and-conquer questions like you would use when playing 20 Questions. Decision trees can handle both Classification and Regression, essentially through building the trees out of lots of tiny regression analyses. Random forests extend this concept by creating many different decision trees. This collection is then used to generate a consensus result, allowing it to look at a problem from many perspectives at once.
Robust Consensuses of All Approaches
Like random forests, ensemble models take a group of many different models and form a consensus to arrive at the most robust answer. In fact, random forests are a type of ensemble model. However, the concept can be extended to form a consensus from many different types of models – random forests, deep learning networks, and more, all mixed together. And as with random forests, this lets ensemble models look at many different perspectives allowing them to do a better job balancing bias with variance.
Deep learning and ensemble models have especially benefited from the processing power made available by cloud computing, allowing these more complex and comprehensive algorithms to process in minutes or hours instead of days or weeks. Simultaneously, big data processing technologies have made it possible to apply these methods to much larger and diverse datasets, providing even more insight.
Data scientists (professional and otherwise) will ideally have as many of these tools at their disposal as possible. Different problem types and different datasets respond better to different kinds of approaches. That’s why an intelligent combination of technology selection along with ensemble modeling is often the best approach.
Meet nD, again.
nD was created as a solution for big data problems. The nD interface and it’s easy-to-use wizards help you navigate complex concepts like visualizations, analytics and machine learning by reducing them to simple concepts to help you quickly and intuitively extract the value in your data.
The nD Model Wizard does this by reducing machine learning to a simple question of which type of analysis you would like to perform, like Anomaly Detection, Classification or Optimization. From there, it intelligently selects and distributes the best modeling algorithms for your data and type of analysis across its cloud computing architecture to efficiently find the best ensemble model from multiple algorithms. Then, the results are presented using nD’s powerful smart visualizations so you can intuitively find value in your data no matter how deeply it is hidden.
The nD Pipeline Architecture
Extensible data processing is a core concept in nD. More than just a “black box” or “one size fits all” platform, nD is designed so it can be extended, adapted and customized to meet your specific needs. As with smart visualizations and advanced analytics, nD’s approach to machine learning is built for extensibility using the powerful nD Math Language. This allows data scientists and researchers to extend and adapt the preprocesing and machine learning algorithms to meet their specific needs. But beyond that, the intelligence behind selecting those algorithms is also extensible.
nD Pipelines are processing algorithms built from the nD Math Language to describe how to select and adapt the right preprocessing and machine learning for any given data or analysis. The nD platform has many built-in pipelines to cover a wide variety of questions, and like all other aspects of nD they are designed to be customized and extended. This allows your team to take nD even further, enabling you to apply industry-specific know-how to nD’s intelligence.
Above, we mentioned “you and your team”. Sometimes, you’re the only person working with a particular set of data. However, more often than not, you want to collaborate with other people. This could be from the start, somewhere in the middle after some initial investigation has occurred, or closer to the end taking advantage of decisions and visualizations created during data discovery.
The final article in our blog series addresses this expanded view of data discovery with Collaboration & Deployment in the Cloud.