Automatic model development
In order to provide outstanding accuracy, the Predictron Labs framework examines the statistical properties of the dataset and selects the appropriate ML algorithm from the list of available methods.
For optimal data presentation to the selected algorithm, the Predictron Labs framework will perform certain data transformation processes (imputation, outlier detection, normalization, distribution fitting, and category merging) on the incoming data.
In some cases more than one model is trained for predicting the target. This will enable the framework to select the best performing candidate.
The applied data transformation steps are transparent for the user; and the same transformation processes will be used when deploying or evaluating the model.
Binary decision problem - example
Let’s examine a typical campaign optimization problem in which we want to predict whether customers will click on the link we send them via email.
This example assumes that we had previously sent out an email to 1000 customers out of which 150 (SumEvent) clicked on the link.
At this time, we will predict the probability of a less frequent event, and train a predictive model accordingly. As a result, our model will predict the probability of clicks for each customer based on the explanatory variables.
The tables below present the results of this prediction in 3 different approaches.
The first table shows the solution with random selection. The second table illustrates the perfect solution when the model can completely identify the customers who clicked based on the explanatory variables. The third table contains the typical results
Results by random selection
Selected \% |
Selected |
SelectedEvent_{RND} |
\%EventRate_{RND} |
Lift_{RND} |
C.Gain_{RND} |
10% | 100 | 15 | 15% | 1 | 10% |
20% | 200 | 30 | 15% | 1 | 20% |
30% | 300 | 45 | 15% | 1 | 30% |
40% | 400 | 60 | 15% | 1 | 40% |
50% | 500 | 75 | 15% | 1 | 50% |
60% | 600 | 90 | 15% | 1 | 60% |
70% | 700 | 105 | 15% | 1 | 70% |
80% | 800 | 120 | 15% | 1 | 80% |
90% | 900 | 135 | 15% | 1 | 90% |
100% | 1000 | 150 | 15% | 1 | 100% |
The perfect solution
Selected \% |
Selected |
SelectedEvent_{IDEAL} |
\%EventRate_{IDEAL} |
Lift_{IDEAL} |
C.Gain_{IDEAL} |
10% | 100 | 100 | 100% | 6.667 | 67% |
20% | 200 | 150 | 75% | 5.000 | 100% |
30% | 300 | 150 | 50% | 3.333 | 100% |
40% | 400 | 150 | 38% | 2.500 | 100% |
50% | 500 | 150 | 30% | 2.000 | 100% |
60% | 600 | 150 | 25% | 1.667 | 100% |
70% | 700 | 150 | 21% | 1.429 | 100% |
80% | 800 | 150 | 19% | 1.250 | 100% |
90% | 900 | 150 | 17% | 1.111 | 100% |
100% | 1000 | 150 | 15% | 1.000 | 100% |
The typical solution
Selected \% |
Selected |
SelectedEvent_{Model} |
\%EventRate_{Model} |
Lift_{Model} |
C.Gain_{Model} |
10% | 100 | 50 | 50% | 3.333 | 33% |
20% | 200 | 80 | 40% | 2.667 | 53% |
30% | 300 | 100 | 33% | 2.222 | 67% |
40% | 400 | 115 | 29% | 1.917 | 77% |
50% | 500 | 130 | 26% | 1.733 | 87% |
60% | 600 | 140 | 23% | 1.556 | 93% |
70% | 700 | 150 | 21% | 1.429 | 100% |
80% | 800 | 150 | 19% | 1.250 | 100% |
90% | 900 | 150 | 17% | 1.111 | 100% |
100% | 1000 | 150 | 15% | 1 | 100% |
The columns in the above tables were computed as follows:
Confusion or contingency chart
This chart is the visual representation of the contingency table. The bars on the chart are equivalent to the columns of the contingency table.
go to the topContingency table
The contingency table (also referred to as cross tabulation or cross tab or an error matrix in machine learning) is a specific table layout that allows the performance visualization of algorithms with categorical or binary output.
For binary or category targets this table is displayed on the model detail page. The rows of the matrix represent the observations by the value of the prediction, while the columns represent the observations by the value of the target variable.
The number of correct predictions is shown in the diagonal of this table where the predicted value equals the target. The non-diagonal elements in each fact column show the number of false predictions.
The non-diagonal elements in the table provide easy monitoring to make sure that the system doesn’t confuse two classes (i.e. common mislabeling).
Dataset detail page
This page provides an overview about the dataset uploaded. Using this page you can review or download the dataset with the predictions added and you can also investigate the statistics of the dataset and the distribution of each variable.
go to the topDataset format requirements
The Predicton Labs framework accepts the uploaded datasets in either gzip compressed or uncompressed CSV formats. Headers of the datasets uploaded in the framework must contain the name of the data fields.
The name length of a variable cannot exceed 20 characters. The first character of a variable name has to be a letter of the English alphabet (a-z or A-Z).
The consecutive characters of the name shall form an arbitrary sequence of alphanumeric characters (a-z, A-Z, 0-9) extended with dash (-) and underscore (_).
The following names are reserved for the Predictron Labs framework:
- _id: names that contain the string “_id” are recognised and used as an id variable
- _target: names that contain the string “_target” are recognised and used as a target variable
- _prediction: names that contain this string are reserved by the pRedictonLabs framework for publishing the predictions generated by the predictive model applied
- _prob: names that contain this string are reserved by the system for publishing the probability of each category when the target variable is binary or categorical
For an example please download our sample Iris dataset from the download section. go to the top
Deferred evaluation
Deferred evaluation in the Predictron Labs terminology refers to the process when the true value of the predicted variable (the fact) becomes available after the model deployment phase.
Deferred evaluation is only available when there is a unique ID in the deployment dataset: the framework uses this specific ID to match predictions with the facts provided when a deferred evaluation dataset is uploaded.
Deffered evaluation is bascially the closure of the fact-feedback-loop
that makes possible automatic re-training of models when drop of the evaluation
parameters triggers.
Dataset for deferred evaluation
The deferred evaluation dataset format is very simple. It has to meet the general requirements and must contain an ID and a target field.
The name of the target field should be identical with that of the target variable in the trainset; while the ID should have the same name as previously given in the relating deployment dataset.
The content of the target variable field should be identical with that of the fact.
Deployment detail page
The deployment detail page provides the UI to review and upload deployment datasets to the predictive model linked. At the top of the page you can see an the deployment history chart that is to overview the datasets uploaded for deployment. The bottom of the page shows the overview of the deployment dataset selected on the top right dataset selector.
go to the topDeployment history chart
This chart gives an overview about the deployment of the linked predictive model. Each dataset submitted for deployment generates a new item on the X axis.
On the Y axis the bar chart shows the volume of predictions generated during deployment, while the line on the Y axis (right hand side) shows the DS Index of the deployment set.
Deployment dataset
The deployment dataset is used for model deployment in the predictive analytics workflow.
The dataset uploaded for deployment has to comply with the general requirements of the framework. Moreover, it has to comprise all the explanatory variables used by the model to be deployed.
As the result of the model deployment, the predictions are determined by the predictive model deployed.
Adding an id variable to the deployment dataset enables the deferred evaluation of the predictive model.
Descriptive statistics table
The descriptive statistics table in the pRedictronLabs framework refers to the table displayed on the dataset detail page that shows the overview of the variables found in the uploaded dataset.
Clicking on the rows in the table will display the distribution of the selected variable in the above distribution chart.
When working with deployment datasets or evaluation datasets the rows may have two colours depending on the statistical difference between the selected variable in the dataset and its equivalent in the trainset that was used to train the model.
Redindicates a significant difference between the distributions compared to the trainset. This is usually caused by systematic changes in the surrounding conditions.
A significant change might cause the DS Index to decrease when the variable has a high importance in the model. The DS Index is an indirect method to monitor model performance; and its drop might result in model performance decrease.
Yellow
indicates some noticeable difference between the distributions compared to the trainset. This is likely to be caused by some systematic change in the surrounding conditions. If the variable has a high importance in the model, a significant difference of the distribution might cause the decrease of the DS Index.
Rows with grey texts indicate that there is no significant difference between the distribution of the variable compared to the distribution of that variable in the trainset.
The columns in the table have the following meaning:
- name: refers to the name contained in the header of the uploaded dataset
- role: refers to the type of the the field in the analysis. It can be either id, target, explanatory, unknown or prediction
- type: shows the type of the variable. It can be either binary,categorical or numeric.
- count: the number of non NULL values stored in the field
- min: in case of numeric variable it refers to the minimum value stored
- max: in case of numeric variable it refers to its maximum value
- mode: the most frequent value of the given variable
- Std.Dev.: in case of numeric variable this refers to the standard deviation
- skew: this measure provides information about the asymmetry of the values stored
- unique: the number of unique values stored
- valid%: the percentage of non NULL values
Distribution chart
This chart shows the distribution of the selected variables’ values. Clicking on the appropriate row in the table of descriptive statistics will display the selected the variable in the chart.
If a dataset is uploaded for evaluation or deployment the chart will also show the distribution of the corresponding variable in the training dataset.
Data Stationary Index (DSI)
The Data Stationary Index is the extent of dataset similarity based on the statistical properties measured to the trainset.
The evaluation and deployment dataset similarity is measured to the trainset by weighting the importance of variable of the applied predictive model.
Explanatory or Independent variables
The explanatory variables are the ones that are used to predict the value of target variable.
The explanatory variables are often called as independent variables or predictors.
For example if we want to predict the average daily temperature based on the month and the latitude, the latitude and the month are the explanatory variables while the target variable is the average daily temperature.
Fact
A fact is a true and veritable piece of information. In the prediction context it is equal to the value of the target variable. The fact is the desired outcome of the prediction.
When a predictive model resolves a specific task perfectly, the prediction equals the fact. however it is much more common that it can be achieved as good as the model accuracy describes.
Gain chart
The cumulative gain chart is a visual aid for measuring model performance with binary output. The population is sorted by the decreasing predicted probability of the less likely output. The Y axis shows the portion of the less likely events by selecting a given percentage of predictions displayed on the X axis. The baseline shows the cumulative gain by applying random selection. In this case the less likely events are evenly distributed and their percentage is the linear function of the selection size. The chart below also shows how the gain chart looks in case of a perfect model. For more details on the cummulative gain please look at this example.
go to the topId variable
The Predictron Labs framework uses the ID variables to uniquely identify each observation during model deployment. When the frameworks finds a variable in the dataset that contains the string _id in its name, the system will automatically recognise and use it as an ID variable. Having an ID variable is optional. Only the deferred evaluation and the online learning requires the inclusion of the _id variable in the deployment dataset.
go to the topLift chart
The lift chart is a visual aid for model performance with binary output. It shows the improvement of the model compared against random selection. The lift is a measure of model effectiveness that is calculated and displayed as the ratio between results obtained with and without the model. A 2.5 lift value means that when sorting the observations by descending order of the predicted event probability, the number of events in the selected dataset will be 2.5 times higher than it would be when using random selection.
If a certain percentage of the observation with the highest predicted event probability is selected, the lift values are shown on the Y axis in the lift chart. The comparison of the selected observation percentage against the whole dataset is shown on the X axis. The lift values normally show a decreasing curve. This is because selecting several observations means that the average probability of the event in the selected dataset is decreasing until it reaches the event rate of the dataset. Normally, the minimum lift curve value is 1 when 100% of the observations is selected, since the event probability in the whole dataset should match with the event probability achieved by random selection.
Local maxima or peaks in the lift curve – except for the left hand side – indicate that the model has lost its best efficiency. This shows that the probability of events is higher among observations with lower predicted probability. A lift value below 1 indicates that the model has lost its predictive power and performs weaker than random selection.
The baseline shows the lift when random selection is applied on the dataset. In this case the chance of event selection is the same as the average probability of the event in the whole dataset; thus the lift is equal to 1. For more details on the lift please review the example below.
go to the topModel deployment or scoring
Model deployment or scoring is the phase of predictive analytics when the predictions are generated with a previously trained model by using the deployment dataset. The predictions can be either downloaded from the deployment details page or using the pRedictronLabs API. During deployment the DS Index provides a-priori information about the expected validity of predictions based on the similarity of the deployment dataset to the trainset.
go to the topModel detail page
This page gives an overview about the predictive model and its performance on different datasets.
A common part of this page is thevariable importance chart. The variable importance is a constant property of the model that is determined at the training phase. This is an attribute of the model itself, and independent from the datasets applied for evaluation or training.
Other evaluation measures shown on this page depend on the dataset on which the models were evaluated.
The prediction accuracy is a common measure for each model type disregarding the type of the target variable.
The other measures displayed depends on the target type predicted.
Binary models are evaluated with the confusion or contingency chart, lift chart, the gain chart and the contingency table.
Model predicting category targets with more than two values are evaluated with confusion or contingency chart, contingency table.
Models predicting numeric targets with more than 30 different values are considered to be continuous targets; and they are predicted accordingly.
Continuous targets are evaluated with prediction & fact scatter plot and by the prediction error distribution chart.
Model evaluation
The evaluation of a predictive model means the comparison of the predictions with the facts. Accordingly, the model accuracy is computed based on this comparison.
When the model accuracy decreases compared to its train time value, it is likely to lose its predictive power and needs to be re-trained in order to more accurately model the hidden relation between the explanatory variables and the target.
The Predictron Labs framework will provide the option to automatically re-train the models when the evaluation process indicates that the existing model is outdated and a better performing model is available.
The model accuracy is not the only measurement method to evaluate model performance: the additional evaluation methods are available on the model detail page.
Model monitoring
In terms of predictive analytics, the objective of model monitoring is to keep track of the predicive model performance. There are two approaches for the model performance monitoring.
The direct method compares the predictions to the facts; and this is called model evaluation. In a predictive modelling project - when the facts are available and their relative independence from the model prediction is ensured (i.e. the model is not deployed on random of observation samples) - the models can be regularly evaluated. Therefore, the performance of the models can be easily tracked with direct measuring.
Nevertheless, in some applications that hold out a random sample of observations, the relative independence of facts is not feasible, or the facts are simply not available (e.g. gathering the facts requires long time and high costs). In these cases model performance monitoring requires an indirect method. When using the indirect method we need to make sure that the actual model environment remains similar to the circumstances provided during model training. When the circumstances are unchanged it is highly probable that the model is still valid and it performs similarly to its train time performance.
A method to ensure the consistency of circumstances is to check the statistical properties of the explanatory variables. In the pRedictronLabs framework, the similarity of the explanatory variables to their train time condition is measured with theData Stationary Index (DSI) by considering the importance of variables used by the predictive model.
Model training
Model training is the process of searching for such meaningful patterns across the explanatory variables that have an influence on the target variable. When a dataset is uploaded as a trainset, the framework will automatically attempt to train a specific model that corresponds with the uploaded dataset. In order to successfully build a predictive model the uploaded dataset has to comply with the requirements of the trainset. When the uploaded dataset is suitable for model development, the Predictron Labs framework will - based on the statistical properties of the dataset - automatically develop a model with the highest expected performance. The final step of model training is the model evaluation when the predictions generated by the model are matched to the facts stored in the target variable in order to determine the accuracy of the model.
go to the topModel Validity Index (MVI)
The Model Validity Index is the measure of model efficiency compared to its prediction accuracy achieved during trainig. This measure help us to decide whenter the model has lost its performance compared to the training or gain more or just stable. COuld e lesss 100 or over 100. The index is measures though the prediction accura accuracy.
go to the topOnline learning
Online learning in predictive analytics or data science refers to the process when the predictive model automatically re-trained or updated.
The trigger for the automated online training is the fact that becomes available after model deployment.
deferred evaluation.
Prediction & Fact scatter plot
In case of numeric target variable this chart gives you an overview about the predictions and about the value of the target variable on a scatter plot.
go to the topPrediction
Prediction is the assumed value of the target variable generated by the predictive model deployed. In case of a perfect model the predictions equal the fact; this however is not a very typical situation since the explaantory variables usually do not contain all the information that perfectly explain the target, so themodel accuracy can be used for explicit prediction/fact comparison.
go to the topPrediction Accuracy or model accuracy
Prediction accuracy is the result of a comparison measurement used for evaluated: the prediction is compared with the fact.
The actual implementation of this measure depends on the type of the target variable.
In case of binary and category targets the accuracy is the percentage predictions that equal the facts, so the percentage of correct predictions:
In case of numeric targets the model accuracy is equal to the percentage of target variance explained by the model. go to the top
Prediction error chart
This chart shows the distribution of difference between the numeric target variable and the predictions. The model performs the best when this distribution is symetric.
go to the topPredictive analytics
Predictvie analytics or data mining is the collection of processes and methods to use the knowledge stored in historical data for predicting future otherwise unknown events. The main components of predictive analytics are the workflow and the predictive model.
The most widely referred predictive analytics workflow is the CRISP-DM process.
In the pRedictronLab framework four components of this process are covered such as model training or development, evaluation, deployment and monitoring.
Predictive model
The predictive model is the key component of predictive analytics. A predictive model can be a statistical model or machine learning algorithm that is used during model training to find patterns among the explanatory variables that has impact to the target variable. Druing the model deployment phase, based on the explanator variables a trained predictive model can be used to determine the most likely outcome of the target variable that is unknown.
go to the topPredictron Labs Framework
Every model and dataset in the Predictron Labs framework is organized in specific branches which connect to their root project.
When a newtrainset is uploaded, the branches fork and connect the dataset with the model that was trained based on the respective trainset.
The respective evaluation and deployment datasets can be regarded as “leaves” connected to the model.
A project might have several development branches and a unique production branch.
The diagram below shows an example to the system of branches with datasets and models.
Production & development branch
In each project a model is nominated to be the production model.
At the top of the model detail page the model can be selected and marked as the production model for the project.
As a default process, the model selected for production will evaluate or deploy the dataset submitted to the project. The datasets submitted to the project become an element of the production branch. The overview of the production branch is displayed on the Observatory dashboard.
Project detail page
This page contains the overview of the predictive analytics project. In the top left box you can reivew the set of explanatory variables used and the target variable that is being predicted by the previously trained predictive model.
The API URL shown here can be used to train a new model by uploading a trainset programmatically, while the API URLs for evaluation and deployment provide programmatic access to the model in the production branch.
The prediction accuracy of the model in production is displayed in the top right corner. Buttons below this chart provide shortcuts for reviewing, evaluating and deploying the model in the project’s production branch.
This page also provides a UI for uploading new trainsets and also for reviewing trainsets that were uploaded earlier.
Clicking on the uploaded trainset at the bottom of the pages allows you to reach the trainset detail page and the model detail page.
Target or dependent variable
In the business applications of predictive analytics predicting the value of the target variable ahead carries business value. The target variable is the one that is being predicted during the model deployment stage; it is also the base of finding patterns across the explanatory variables during model training.
The content of the target variable is different in each business application. When the aim is to detect frauds in a transactional system the target variable indicates the fraudulent transactions, while in a campaign optimization application the value of the target variable refers to the interest of a given customer.
Based on their names, the PredicronLabs framework will automatically identify the target variable in each trainset uploaded.
The role of a variable will be defined as target when its name contains the string "_target". If more than one variable has "_target" in the name, the framework will automatically assign the optimal variable to the specific role.
Trainset or evaluation set
The trainset and the evaluation dataset have identical structures; nevertheless they have different roles in the workflow.
The trainset is used forpredictive model training; while the dataset uploaded for model evaluation is used to evaluate the performance of a previously trained model.
The dataset uploaded for training or evaluation has to meet the general requirements of the framework. Furthermore, it has to contain a target variable and at least one explanatory variable. In the trainset and in the evaluation set the value of the target variable equals the fact.
Variable importance
Variable importance is a measure that describes the weight of a variable when determining the prediction with a predictive model.
The higher the importance of a variable is, the more sensitive the prediction is to the incoming values of that variable.
The highly important variables predominantly define the lifetime of the model.
If the distribution of a highly important variable changes the DS Index will drop; and the model is likely to lose its validity, so theperformance is likely to change significantly compared to its train time performance.
Variable role & type
A variable is any characteristics, number, or quantity that can be measured or counted.
In predictive analytics two main property of a variable is distinguished.
The role of the variable informs how the variable us used in the predicitve analytics process. The role of a variable in a predictive analytics project can be either explanatory, target or id. The pRedictionLabs framework will determine the role of the variables based on their name. The variable containing _target in the name will be used as a target variable, while the one with _id will be used as an id variable. Every other variable present in the dataset will be used as explanator variable.
Another property of a variable is the type that can be either numeric or nominal. Numeric variables have values that describe a measurable quantity as a number, like 'how many' or 'how much'. Therefore numeric variables are quantitative variables. Numeric variables may be further described as either continuous (i.e. height, weight, temperature, age etc.) or discrete (i.e. number of children, number of previous accidents, days since the last visit etc.) Categorical variables have values that describe a 'quality' or 'characteristic' of a data unit, like 'what type' or 'which category'. Categorical variables fall into mutually exclusive (in one category or in another). Therefore, categorical variables are qualitative variables and tend to be represented by a non-numeric value. Categorical variables may be further described as ordinal (i.e.clothing size: small, medium, large or attitudes: etc.) or nominal (e.g.: strongly agree, agree, disagree, strongly disagree) or nominal (i.e. sex business type, color, religion etc.)
Using the pRedicronLabs framework you can freely use any type of variable either as an explanatory, target or id variable.