Buddha once said, "To reach Enlightenment, you must turn data into insight and insight into action". Ok, he didn't say that, but Knowi can help you blend hindsight with foresight and drive actions from your data.
Currently, Knowi supports Classification, Regression and Time-Series Anomaly Detection type Machine Learning use cases, with clustering and deep learning coming soon. We also have a data preparation wizard that will guide you through the steps necessary to clean your data prior to any supervised modeling activities.
Anomaly detection is often used to identify unusual patterns that do not conform to expected behavior (called outliers). There ares many applications in business, from intrusion detection to system health monitoring and from fraud detection in credit card transactions to fault detection in operating environments.
For supervised learning, algorithms are selected based on the type of prediction response:
For example, if you are building a model to predict the $ amount by which a person is likely to default on a credit card payment, then it's regression. However, if your you just want to know if they are likely to default or not then it's classification.
To start the Machine Learning process, simply select the Machine Learning icon, create your workspace and let Knowi guide you through the steps required to create your Machine Learning models!
Triggers and actions can be applied to the results. For example, you can send an alert or a webhook into your application for the users with a high risk of default for the use case above. The process for setting up triggers and alerts on a query with machine learning remains the same as a normal dataset/query. For more details, see Alerts.
The very first thing required when starting a Machine Learning project in Knowi is to create a workspace. A workspace can be thought of as a folder that will contain all your subsequent machine learning models for the particular use case in question.
Once the workspace is created and the required type of modeling determined, the user is then required to either select or upload their training dataset. This dataset will include historical data relating to the predictor variable they wish to predict. The example flow below is for supervised learning (classification and regression).
The user is then able to perform Cloud9QL upon the training dataset, select the variable they wish to predict and also analyze their data to not only see the columns present in their training dataset, but they can also view statistical information about the data in each column by clicking on the icon in each column header.
Once the data is uploaded and the attribute to be predicted has been selected, the user then selects Prepare Data. This will then guide the user step by step through some tasks designed to help clean the data items ready for the machine learning algorithms to use.
When entering the Machine Learning module, the user will automatically be taken to a list of their current workspaces and published models. To edit a previously created workspace, simply click on the edit icon next to the workspace name.
Once your data is loaded and your predictor attribute selected, the next step is to ensure that your data is ready for the machine learning algorithms to successfully run against.
Knowi will lead you through a series of data preparation steps (some are mandatory and some are optional) prior to running the algorithms of your choice.
Note that the results of each step are saved. If a user leaves the data preparation area and returns later the system will direct them to the next step in the process automatically. The user also always has access to view their data by clicking in the top right hand corner of the box.
Firstly, we need to ensure that all data types are correct. The user has the option to modify the data types, if necessary. The user simply selects the correct data type per column and then selects 'Next Step'
The next step is the identification of potential outliers in your data. Knowi will highlight these values and allow the user to either remove all of them, remove selected values or skip the step completely.
Note that the user also always has the ability to go back to the Cloud9QL processing area and inspect their data again.
It is important that the training dataset does not have any missing values (null values). Rows containing missing values will either need to be removed or imputed (calculated) using the mean of the associated column. The user has the option to remove or impute values. This step is mandatory.
The system will allow the user to:
1. enter a % above which all rows with this percentage of missing
values will be removed from the dataset (eg, remove all rows where
there are >25% missing numerical values)
2.enter a % above which all columns with this percentage of missing
numerical values will be removed from the dataset (eg, remove all
columns where there are >9% missing values)
3.impute the remaining missing values on a column by column basis
If your numerical attributes are comprised of different scales (for example, weight, height, age, etc.) then you have the option of rescaling this data. This is not required, but may boost performance. Try creating different models for your non-rescaled, standardized and normalized data and see which ones achieve higher accuracy.
Two methods of rescaling are offered; Normalization (when you do not know the distribution of your data or the distribution is not Gaussian; this will set all values across the board to be between 0 and 1) and Standardization (if your data is Gaussian; this will transform the data to have a mean of 0 and a standard deviation of 1).
Simply select the data items to rescale, the method and chose 'Next Step'.
This step may be skipped entirely.
Some algorithms, such as Decision Trees, work better with discrete data. This means taking numerical data and converting it into logical, ordered groups or bins of data (ordinal attributes). It is most useful if you believe there are natural groupings within your column data or if your numerical data has a large range of values (for example, -infinity > 7,000,000,000).
This step is optional and can be skipped.
Some algorithms only work with numerical data and do not support nominal or ordinal data. It will therefore be necessary to convert these values into real values. Each category will be transformed into a column (or attribute) and 0 or 1 will be inserted as the value. This is called widening your dataset.
For example, a column called Gender typically has permissible string entries for 'Male', 'Female' and 'Not Specified'. If the value in a particular case is 'Male', then this would become three columns (one for each category), Gender:Male (with a value of 1), Gender:Female (with a value of 0), Gender:Not Specified (with a value of 0)
Existing column below would become three columns:
existing column | value | new column | value |
---|---|---|---|
Gender | Male | Male | 1 |
Female | 0 | ||
Not Specified | 0 |
This concludes the data preparation activity. Any decisions made along the way have been saved and a user can jump back to any previous step and make changes, if they wish.
The next step in our machine learning journey is to now select the model features that will help predict the outcome.
Once all data has been prepared, the user is now asked to select the features (data attributes) to feed into the model creation.
Feature selection is a crucial part of machine learning and a user will typically create many different models using many different combinations of features before finding the best fit.
The user has two options at this point, to either manually select their features or to let Knowi auto-select features for them based upon correlation and information gain algorithms that we run against the dataset.
It is highly recommended to run your model several times with different features selected.
Once the features have been selected, the user then selects the algorithms to run and train their models.
After selecting the features, the user is then able to select the algorithms they wish to use to train their model(s).
The user can select one or more algorithms and can also repeat using different features and settings each time.
The algorithms displayed depend on whether the user specified Classification or Regression as the workspace type at workspace creation time.
Clicking on the settings cog will allow the user to enter algorithm specific parameters.
Once all required algorithms and their settings have been entered, the user then selects 'Train'.
The models and their corresponding results will then appear in the Results section.
Each model result has 3 icons associated with it. These allow you to inspect the results of the model and also publish the chosen model. Published models can then be used against a live Knowi queries to predict against incoming data.
view the data results of the trained model and see the predicted output against the original predictor input | |
view the statistical results of each model | |
publish the chosen model and make it available for use against incoming data |
To use the model against a live Knowi query, the user selects the 'Use Model' option corresponding to the model they wish to use. The system will then take them to the query list page where they can select the appropriate query and associate the model to be used at query run time.
In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available.
An algorithm that implements classification, especially in a concrete implementation, is known as a classifier.
Knowi currently support 4 different classifiers:
As an example, to predict whether a client will default on their next payment period based on their prior payment behavior:
Download and use the data from UCI Machine Learning Repository. This dataset contains 30,000 client Credit Card data with 24 attributes including:
Personal characteristics such as age, education, gender, and marital status
Billing/payment history for the 6 months period from April to September of 2005
Navigate over to our Workspaces page and you will be lead through all the necessary steps to create your model.
Time-series anomaly detection is a feature used to identify unusual patterns that do not conform to expected behavior, called outliers. There are many applications in business, from intrusion detection (identifying strange patterns in network traffic that could signal a hack) to system health monitoring (spotting a malignant tumor in an MRI scan), and from fraud detection in credit card transactions to fault detection in operating environments.
Upon creation of your Anomaly Detection Workspace, the user will be presented with a number of configuration steps.
Olympic Model (Seasonal Naive) The naive seasonal model where the prediction for next point is a smoothed average over the previous n periods.
Double and Triple Exponential Smoothing Models Both are popular models used to produce smoothed time- series. The exponential smoothing variant add trend and seasonality into the model. The ETS model used automatically picks the best 'fit' exponential smoothing model.
Moving Average Model Here, the forecast is based on an artificially constructed time series in which the value for a given time period is replaced by the mean of that value and the values for some number of the preceding and succeeding time periods.
Weighted Moving Average and Naive Forecasting Models The forecast for both of these models is based on an artificially constructed time series in which the value for a given time period is replaced by the mean of that value and the values for some number of the preceding and succeeding time periods. The Weighted Moving Average is a special case of the moving average model.
Regression Model Models the relationship between x & y using one or more variable.
ARIMA Model Uses the Autoregressive Integrated Moving Average method.
As soon as the above steps have been completed and the Run Analysis option selected an anomaly detection model is trained and applied to the data. The precision of the model increases over time as more data is made available.
The anomaly detection visualization itself consists of a configurable blue band range of expected values (acceptable threshold limit) along with the actual metric data points. Any values outside of the blue band range are considered anomalies and will appear in red.
The width of the blue band of the expected values can be configured by setting the threshold attribute explicitly on the settings modal dialog. This Anomaly detection threshold is the mean absolute percentage deviation from the expected value. The default threshold value set is 50% but this can be modified.
As an option you can save the anomaly detection visualization results as widget that can then be shared on one or more dashboards. To do this, simply select teh Save Widget option and enter a widget name. The widget will now appear in the general widget list for subsequent use outside of the Machine Learning module.
However all anomaly related information available within the widget settings bar will not be readily available for user edit. All anomaly detection settings have to be changed via the anomaly workspace directly.
One crucial feature around the anomaly detection is the ability to configure alerts that provide automatic notification when new anomalies are detected.
Channels such as email, webhook and slack can be easily set up by selecting the alerts button from the control list.
By default the look back interval is set to equals to the alert frequency, so any anomaly will be communicated within that interval only. As soon as at least 1 anomaly is detected the system will trigger the alert.
There are several fixed email placeholders that may be used in the email template to add additional information:
The workspace can contain one or more anomaly detection models. To add another into the workspace, simply choose the Add Analysis button.
In regression problems, we are trying to predict continuous values as the output. This differs from classification, where the output is a category or class. There are a number of different types of regression problems we support using the following algorithms:
As an example, we will build a predictive model to predict house price (price is a number from some defined range, so it will be regression task). We will be using linear regression to predict sales price based on multiple attributes.
You can download the house price dataset here.
Let's suppose you want to sell your house and you are wondering what you can get for it. You usually look for other homes similar to yours, in the same area and close to the same age as yours. We will do something similar, but with Linear Regression Machine Learning.
Attribute Information:
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centers
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population
PRICE True value of owner-occupied homes in $1000's
We will be training our model using PRICE.
Now, navigate over to our Workspaces page and you will be lead through all the necessary steps to create your model.
A radial basis function network is an artificial neural network that uses radial basis functions as activation functions. It is a linear combination of radial basis functions. They are used in function approximation, time series prediction, and control.
A radial basis function (RBF) is a real-valued function whose value depends only on the distance from the origin, so that ∅(x)=∅(||x||); or alternatively on the distance from some other point c, called a center, so that ∅(x,c)=∅(||x-c||). Any function ∅ that satisfies the property is a radial function. The norm is usually Euclidean distance, although other distance functions are also possible. For example by using probability metric it is for some radial functions possible to avoid problems with ill conditioning of the matrix solved to determine coefficients wi (see below), since the ||x|| is always greater than zero.
In linear regression, the model specification is that the dependent variable is a linear combination of the parameters. The residual is the difference between the value of the dependent variable predicted by the model, and the true value of the dependent variable. Ordinary least squares obtains parameter estimates that minimize the sum of squared residuals, SSE (also denoted RSS).
The k-nearest neighbor algorithm (k-NN) is a method for classifying objects by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is typically small). k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification.
The simplest k-NN method takes a data set of feature vectors and labels with Euclidean distance as the similarity measure.
The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on the classification, but make boundaries between classes less distinct. A good k can be selected by various heuristic techniques, e.g. cross-validation. In binary problems, **it is helpful to choose k to be an odd number as this avoids tied votes**.
The nearest neighbor algorithm has some strong consistency results. As the amount of data approaches infinity, the algorithm is guaranteed to yield an error rate no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data). k-NN is guaranteed to approach the Bayes error rate, for some value of k (where k increases as a function of the number of data points).
The user can also provide a customized distance function.
Often, the classification accuracy of k-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighborhood Components Analysis.
Alternatively, the user may provide a k-nearest neighbor search data structure. Besides the simple linear search, KD-Tree, Cover Tree, and LSH (Locality-Sensitive Hashing) for efficient k-nearest neighbor search are also available.
A KD-tree (short for k-dimensional tree) is a space-partitioning dataset structure for organizing points in a k-dimensional space. Cover tree is a data structure for generic nearest neighbor search (with a metric), which is especially efficient in spaces with small intrinsic dimension. The cover tree has a theoretical bound that is based on the dataset's doubling constant. LSH is an efficient algorithm for approximate nearest neighbor search in high dimensional spaces by performing probabilistic dimension reduction of data.
Nearest neighbor rules in effect compute the decision boundary in an implicit manner. In general, the larger k, the smoother the boundary.
The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.
To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the illustration above. As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects.
Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen.
The users can change the following settings:
Generation Model | Multinomial or Bernoulli. Th multinomial model generates one term in each position of the document. The multivariate Bernoulli model or Bernoulli model generates an indicator for each term , either indicating presence of the term in the document or indicating absence. |
Add k-smoothing | By default, we use add-one or Laplace smoothing, which simply adds one to each count to eliminate zeros. |
Support vector machines can be used as a regression method, maintaining all the main features of the algorithm. In the case of regression, a margin of tolerance ∈ is set in approximation. The goal of SVR is to find a function that has at most ∈ deviation from the response variable for all the training data, and at the same time is as flat as possible. In other words, we do not care about errors as long as they are less than ∈, but will not accept any deviation larger than this.
A decision tree can be learned by splitting the training set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning.
Classification and Regression Tree techniques have a number of advantages over many of those alternative techniques.
Simple to understand and interpret.
In most cases, the interpretation of results summarized in a tree is very
simple. This simplicity is useful not only for purposes of rapid
classification of new observations, but can also often yield a much
simpler "model" for explaining why observations are classified or
predicted in a particular manner.
Able to handle both numerical and categorical data.
Other techniques are usually specialized in analyzing datasets that
have only one type of variable.
Tree methods are nonparametric and nonlinear.
The final results of using tree methods for classification or regression
can be summarized in a series of (usually few) logical if-then conditions
(tree nodes). Therefore, there is no implicit assumption that the
underlying relationships between the predictor variables and the
dependent variable are linear, follow some specific non-linear link
function, or that they are even monotonic in nature. Thus, tree methods
are particularly well suited for data mining tasks, where there is often
little a priori knowledge nor any coherent set of theories or predictions
regarding which variables are related and how. In those types of data
analytics, tree methods can often reveal simple relationships between
just a few variables that could have easily gone unnoticed using other
analytic techniques.
One major problem with classification and regression trees is their high variance. Often a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. Besides, decision-tree learners can create over-complex trees that cause over- fitting. Mechanisms such as pruning are necessary to avoid this problem. Another limitation of trees is the lack of smoothness of the prediction surface.
Logistic regression (logit model) is a generalized linear model used for binomial regression. Logistic regression applies maximum likelihood estimation after transforming the dependent into a logit variable. A logit is the natural log of the odds of the dependent equaling a certain value or not (usually 1 in binary logistic models, the highest value in multinomial models). In this way, logistic regression estimates the odds of a certain event (value) occurring.
Logistic regression has many analogies to ordinary least squares (OLS) regression. Unlike OLS regression, however, logistic regression does not assume linearity of relationship between the raw values of the independent variables and the dependent, does not require normally distributed variables, does not assume homoscedasticity, and in general has less stringent requirements.
Compared with linear discriminant analysis, logistic regression has several advantages:
Logistic regression also has strong connections with neural network and maximum entropy modeling. For example, binary logistic regression is equivalent to a one-layer, single-output neural network with a logistic activation function trained under log loss. Similarly, multinomial logistic regression is equivalent to a one-layer, softmax- output neural network.
Logistic regression estimation also obeys the maximum entropy principle, and thus logistic regression is sometimes called "maximum entropy modeling", and the resulting classifier the "maximum entropy classifier".
A decision tree can be learned by splitting the training set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions.
The settings cog allows the user to enter options for the following:
Decision tree techniques have a number of advantages over many alternative techniques.
Simple to understand and interpret:
In most cases, the interpretation of results summarized in a tree is very
simple. This simplicity is useful not only for purposes of rapid
classification of new observations, but can also often yield a much
simpler "model" for explaining why observations are classified or
predicted in a particular manner.
Able to handle both numerical and categorical data:
Other techniques are usually specialized in analyzing datasets that
have only one type of variable.
Nonparametric and nonlinear:
The final results of using tree methods for classification or regression
can be summarized in a series of (usually few) logical if-then conditions
(tree nodes). Therefore, there is no implicit assumption that the
underlying relationships between the predictor variables and the
dependent variable are linear, follow some specific non-linear link
function, or that they are even monotonic in nature. Thus, tree methods
are particularly well suited for data mining tasks, where there is often
little a priori knowledge nor any coherent set of theories or predictions
regarding which variables are related and how. In those types of data
analytics, tree methods can often reveal simple relationships between
just a few variables that could have easily gone unnoticed using other
analytic techniques.
One major problem with classification and regression trees is their high variance. Often a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. Besides, decision-tree learners can create over-complex trees that cause over- fitting. Mechanisms such as pruning are necessary to avoid this problem. Another limitation of trees is the lack of smoothness of the prediction surface.
The UCI Machine Learning Repository contains many full data sets that can be used to test and train machine learning models. One such example is the Breast Cancer Wisconsin (Diagnostic) Data Set which relates whether breast cancer is benign or malignant to 10 specific aspects of the tumor. Based on this dataset, we can develop a model that will be able to determine the likelihood of breast cancer being benign or malignant.
The process of using machine learning to analyze data is made easy with Knowi Adaptive Intelligence. Given a training dataset, Knowi can apply either classification or regression algorithms to build valuable insights from the data.
Here is a step-by-step guide about how to turn that data into a powerful machine learning model using Knowi:
To start the machine learning process, go to www.knowi.com. If you are not already a Knowi user, sign up for a free trial to complete this tutorial. Once in, go into the machine learning section that can be found on the left-hand side of the screen. From there, start a new workspace and you will be given a choice of either making a classification or regression model. For the case of the breast cancer example, the workspace will be classification due to the nature of the data where the variable that we are predicting will always fall into either of two categories. Next, upload the Breast Cancer Wisconsin (Diagnostic) Data Set.
After uploading, and possibly manipulating the file, chose the Attribute to Predict from the drop-down list. In the case of the breast cancer data, the attribute that is being predicted is the class of the tumor. Following the choice of the prediction variable, the initial analysis takes place by using the Analyze Data button. This displays the data on the screen and allows an opportunity to scroll through the data looking for patterns.
Prepare the Data
After analyzing, data preparation begins. Data preparation is an optional, wizard driven process that involves going through a step-by-step process where the program confirms the training set datatypes, identifies and allows for the removal of outliers, reports missing data with the option to remove or impute values, allows for rescaling of the data, groups into discrete bins and, finally, provides the option to create dummy variables. All decisions can be changed by moving backwards and forwards through the steps at any time.
For the Breast Cancer data, a small amount of rescaling and grouping were necessary to increase accuracy.
Whether you came in with prepared data, or just finished the process, the next step is to select which variables to be used in the model. To make this decision it is essential to check back at the data, looking for patterns and correlations.
At this point you are left with choosing between the available algorithms (i.e. Decision Tree, Logistic Regression, K-Nearest Neighbor, or Naive Bayes). Knowi makes it easy to choose all available and compare them with useful attributes such as accuracy or the absolute deviation. Pressing the little eye next to the model created in the results section will show a preview of the input data along with the predictions of the program. Next to the eye there is a plus sign that, when pressed, will display the details of that specific model. It is beneficial to produce many models and tweak settings each time to find the best one for the situation. All past models are saved in the history and can be viewed, compared, and even published.
The last step is publication. This step involves the button next to the plus sign. Upon publishing, a prompt to name the model will be displayed. It is possible to publish as many models as needed from the same data. All models that are created can be viewed and compared directly in the 'Published Models' tab within Machine Learning.
How to Apply a Model to a Query
Now you have officially created a machine learning model that can seamlessly be applied to any query. To integrate it into a dataset simply press 'Apply Model' while performing a query and this will add a field where all the machine learning models will be available to be selected and used. Pressing the preview button on the screen will show the data along with the predictions made by the model.
Actions from Insight Made Easy
With those six steps you have a machine learning model that can be integrated into any workflow and create new visualizations and insights that will drive downstream actions. The applications of the machine learning model are endless and can be tailored to the individual need. Once a model is made, and put in place, there are many actions that can be performed to gain meaning and spark reactions. This is done through trigger notifications. A trigger notification is a notification that will act in the case that a certain condition is met. In the scope of the breast cancer machine learning model, an alert can be set to email a doctor the patient's information in the situation that the model found a tumor to be malignant. This enables more than just insights, it generates action.
Summary:
The process of creating a model within Knowi is so easy that anyone can do it, and it starts with simply uploading a dataset. Data can be uploaded from a file, SQL and NoSQL sources, along with REST-APIs. Following the uploading of a file, Knowi has built-in algorithms available, or the option to create your own, along with a designated page to review multiple factors and evaluate the best algorithm for your situation. Using this method, the Breast Cancer training data was loaded from the UCI Machine Learning Repository into a Knowi workspace, then analyzed with the built-in data prepping tools. The resulting model was ready to be integrated into any workflow and autonomously perform actions based on the results, such as sending an alert to a doctor depending on the outcome of the test. Give Knowi a try and see how easy visualizing and learning from your data can be.
References
Dheeru, D., & Karra Taniskidou, E. (2017). UCI Machine Learning Repository. Retrieved from University of California, Irvine, School of Information and Computer Sciences: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Knowi. (2017). Adaptive Intelligence for Modern Data. Retrieved from Knowi Web site: www.knowi.com/