Prediction
Jump to navigation
Jump to search
R caret train glmnet final model lambda values not as specified
Rules
- What it takes to develop a successful (clinical) prediction model. TEN important things to avoid
Stability
Stability of clinical prediction models developed using statistical or machine learning methods 2023, R package pminternal
Feature selection
- https://topepo.github.io/caret/recursive-feature-elimination.html
- Feature Selection Approaches from http://r-statistics.co
- Leave-one-out feature importance (LOFO) + LightGBM
Recursive feature elimination
- Feature Recursive Elimination (FRE) is a feature selection algorithm developed by Isabelle Guyon and her colleagues. It is a recursive algorithm that is used to select a subset of relevant features from a large pool of candidate features in a dataset.
- The basic idea behind FRE is to recursively eliminate features that are least informative or least relevant to the prediction task. At each iteration, the algorithm removes the feature with the lowest contribution to the prediction performance, as measured by a performance metric such as accuracy or F1-score. The algorithm continues to eliminate features until a stopping criterion is met, such as a fixed number of features, a minimum prediction performance, or a user-defined stopping threshold. So RFE depends on
- Prediction algorithm (ef randomForest)
- Number of desired features (a single number of a set)
- If the number of desired features is unsure, we want to calculate the performance measure from CV. So we need to input the number of folds and choose whether to use repeated CV or how many times.
- FRE has been applied to a variety of machine learning problems, including classification, regression, and clustering. It is often used in combination with other feature selection algorithms, such as wrapper methods or filter methods, to provide a more comprehensive and robust approach to feature selection.
- One advantage of FRE is that it provides a simple and straightforward way to select features that are most relevant to the prediction task. It also allows for the evaluation of the importance of individual features and provides a way to visualize the relationship between the features and the prediction performance. Another advantage of FRE is that it is computationally efficient and can handle large datasets with many features.
- R packages
- caret: ?rfe
- Boruta: ?Boruta
- mlr: Feature Selection
- In mathematical terms, RFE can be formulated as follows:
- Initialize the feature set F to contain all features in the dataset X.
- Fit a model, such as a Support Vector Machine (SVM), to the data X using the current feature set F.
- Rank the features in F based on their importance, as determined by the model coefficients or other feature importance measures.
- Remove the feature with the lowest importance from the feature set F.
- Repeat steps 2-4 until a stopping criterion is reached.
At each iteration of the RFE process, the model is refitted with the remaining features, and the importance of each feature is re-evaluated. By removing the least important features at each iteration, the RFE process can identify a subset of the most important features that contribute to the prediction performance of the model. - Backward elimination is not a special case of Recursive Feature Elimination (RFE). While both methods aim to select the most relevant features for a model, they use different approaches to achieve this goal. Backward elimination is specific to linear regression models and relies on p-values to determine which features to remove, while RFE can be used with any model that assigns weights to features or has a feature importance attribute and removes the weakest features based on their importance.
- When using Recursive Feature Elimination (RFE) with the Random Forest algorithm, RFE starts by training a Random Forest model on the entire set of features and computing the importance of each feature. The least important features are then removed from the current set of features, and the process is repeated on the pruned set until the desired number of features is reached. The importance of each feature can be determined by the Random Forest algorithm’s internal method for measuring feature importance.
- The number of features to select in Recursive Feature Elimination (RFE) is a hyperparameter that can be chosen based on the specific problem and dataset. There are several approaches to determining the optimal number of features:
- Domain knowledge: If you have prior knowledge about the problem domain, you may have an idea of how many features are relevant and should be included in the model.
- Cross-validation: You can use cross-validation to evaluate the performance of the model with different numbers of features and choose the number that results in the best performance.
- RFECV: In scikit-learn, you can use the RFECV class, which performs RFE with cross-validation to automatically find the optimal number of features.
SVM RFE
- Support Vector Machines (SVMs) can be used to perform Recursive Feature Elimination (RFE). RFE is a feature selection method that involves iteratively removing the least important features from a dataset and re-fitting a model until a stopping criterion is reached. The goal of RFE is to identify a subset of the most important features that can be used to build a predictive model with good accuracy.
- SVMs are a type of machine learning algorithm that can be used for classification and regression tasks. They work by finding the hyperplane that maximizes the margin between the classes in a dataset. The hyperplane is defined by a subset of the features, called support vectors, that have the largest influence on the classification decision.
- To perform RFE with SVMs, one can use the support vectors as a measure of feature importance and remove the features with the smallest magnitude of coefficients in the SVM model. At each iteration, the SVM model is refitted with the remaining features and the process is repeated until a stopping criterion is reached.
- In this way, RFE with SVMs can be used to identify a subset of the most important features that contribute to the prediction performance of the SVM model. RFE with SVMs can also be used to handle high-dimensional datasets with many features, as it can help reduce the dimensionality of the data and improve the interpretability of the model.
- Common stopping criteria that are used in RFE with SVMs, including:
- Number of Features: One common stopping criterion is to specify a fixed number of features to be included in the final model. For example, one can specify that the RFE process should stop when the number of features is reduced to a certain value, such as 10 or 50.
- Cross-Validation Accuracy: Another common stopping criterion is to use cross-validation accuracy as a measure of performance. The RFE process can be stopped when the cross-validation accuracy reaches a certain threshold, or when it starts to decrease, indicating that further feature elimination is not beneficial for improving performance.
- Classification Error: A third common stopping criterion is to use the classification error, or misclassification rate, as a measure of performance. The RFE process can be stopped when the classification error reaches a certain threshold, or when it starts to increase, indicating that further feature elimination is not beneficial for improving performance.
- The choice of stopping criterion will depend on the specific requirements of the user and the characteristics of the dataset. It is important to select a stopping criterion that balances the need for feature reduction with the need to preserve performance and avoid overfitting.
- One example:
library(caret) library(mlbench) require(randomForest) # require(e1071) data(Sonar) X <- Sonar[, 1:60] y <- Sonar[, 61] ctrl <- rfeControl(functions = rfFuncs, method = "cv", number= 10) # 10-fold CV # ctrl <- rfeControl(functions = svmFuncs, method = "cv", number =10) rfe_result <- rfe(x = X, y= y, sizes=c(1:10), rfeControl =ctrl) # range of feature subset sizes print(rfe_result) plot(rfe_result)
- Another example:
library(caret) data(iris) # Split the data into training and testing sets set.seed(123) train_index <- createDataPartition(iris$Species, p = 0.8, list = FALSE) train_data <- iris[train_index, ] test_data <- iris[-train_index, ] # Preprocess the data preProcess_obj <- preProcess(train_data[, -5], method = c("center", "scale")) train_data_preprocessed <- predict(preProcess_obj, train_data[, -5]) train_data_preprocessed$Species <- train_data$Species # Perform Recursive Feature Elimination with a SVM classifier ctrl <- rfeControl(functions = rfFuncs, method = "repeatedcv", repeats = 3, verbose = FALSE) svm_rfe_model <- rfe(x = train_data_preprocessed[, -5], y = train_data_preprocessed$Species, sizes = c(1:4), rfeControl = ctrl, method = "svmLinear") print(svm_rfe_model) # Recursive feature selection # # Outer resampling method: Cross-Validated (10 fold, repeated 3 times) # # Resampling performance over subset size: # # Variables Accuracy Kappa AccuracySD KappaSD Selected # 1 0.9583 0.9375 0.06092 0.09139 # 2 0.9611 0.9417 0.06086 0.09129 * # 3 0.9583 0.9375 0.06092 0.09139 # 4 0.9583 0.9375 0.06092 0.09139 # # The top 2 variables (out of 2): # Petal.Width, Petal.Length
- The rfe function in the caret package in R can be used with many different classifier methods (Full list), including:
- svmLinear: Linear Support Vector Machine (SVM)
- svmRadial: Radial Support Vector Machine (SVM)
- knn: k-Nearest Neighbors (k-NN)
- rpart: Recursive Partitioning (RPART)
- glm: Generalized Linear Model (GLM)
- glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models (GLMNET)
- xgbLinear: Extreme Gradient Boosting (XGBoost) with a linear objective
- xgbTree: Extreme Gradient Boosting (XGBoost) with a tree-based objective
- randomForest: Random Forest
- gbm: Gradient Boosting Machines (GBM)
- ctree: Conditional Inference Trees (CTREE)
Boruta
knn
- How to code kNN algorithm in R from scratch
- Find K nearest neighbors, starting from a distance matrix
- K Nearest Neighbor : Step by Step Tutorial
- Assumptions of KNN
- KNN vs. K-mean
- knn con
- A small value of k will lead to a large variance in predictions. Alternatively, setting k to a large value may lead to a large model bias.
- How to find best K value?
Random forest
- randomForest package
- Error: protect(): protection stack overflow. The trick works on my data with 26 obs and 28110 variables.
- A complete guide to Random Forest in R. n=1000, p=21. The ROC part is wrong since the prediction is not done using a test data. There are two parameters:
- mtree (Number of trees used in the forest, default is 500) and
- mtry (Number of random variables used in each tree, e.g. squart root of the number of all predictors). tuneRF() can be used to determine the best mtry.
- 8.5 Permutation Feature Importance from Interpretable Machine Learning by Christoph Molnar. No code.
- 15 Variable Importance from caret. varImp() function.
- How to find the most important variables in R
- ?randomForest::importance
- Random Forest Variable Importance and a simple simulated data with 2 correlated predictors (X1~N(0,1), X2~N(X1,1), y=2*X1+N(0,.5^2)). We can evaluate the variable importance from this toy example. The result is satisfying.
- If you just print the importance object from the model they are the raw importance values. However, when you use the importance function, the default for the scale argument is TRUE which returns the importance values divided by the standard error.
- Measures of variable importance in random forests
- Difference between varImp (caret) and importance (randomForest) for Random Forest
- Utilizing Machine Learning algorithms (GLMnet and Random Forest models) for Genomic Prediction of a Quantitative trait
- Gene selection and classification of microarray data using random forest 2006, and the R package varSelRF (2017-07-10)
- Result - Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.
- Gene selection - The most reliable measure is based on the decrease of classification accuracy when values of a variable in a node of a tree are permuted randomly
- this measure of variable importance is not the same as a non-parametric statistic of difference between groups, such as could be obtained with a Kruskal-Wallis test)
- Feature Importance in Random Forest
- Question: if a predictor's importance score is not zero, does it mean that predictor has been used in the random forest?
- No, a non-zero importance score does not necessarily mean that a predictor has been used in the random forest. The importance score is a measure of how much the prediction accuracy decreases when the values of that predictor variable are permuted. A non-zero importance score indicates that the predictor variable has some effect on the prediction accuracy, but it does not necessarily mean that the predictor has been used in all trees of the random forest. In fact, it is possible for a predictor variable to have a non-zero importance score even if it is not used in any of the trees in the random forest. This can happen if the predictor variable is correlated with other predictor variables that are used in the model.
Gradient boost
GBDT: Gradient Boosting Decision Trees
rpart and decision tree
- https://cran.r-project.org/web/packages/rpart/index.html and two vignettes.
- How to Fit Classification and Regression Trees in R
- Decision Trees in R using rpart
- Decision Trees and Pruning in R
- R for Statistical Learning
- Prune
- How to prune a tree in R? A rule of thumb is to choose the lowest level where the rel_error + xstd < xerror.
- x-val relative error. What is the difference between rel error and x error in a rpart decision tree?
- Complexity Parameter in Decision Tree
- rpart.plot package
- vignette (why is it not shown on CRAN). Page 3 shows what is included in a node depending on whether the response is binary, continuous or multi-class.
- Binary response: the predicted class, the predicted probability of an event, the percentage of observations in the node.
- Continuous response: the predicted value, the percentage of observations in the node.
- Multi-class: the predicted class, the predicted probability of each class, the percentage of observations in the node.
- Plotting Decision Trees in R with rpart and rpart.plot
- Plotting rpart trees with the rpart.plot package
- Predicting with decision trees using rpart
- Why are the cp values in plotcp() chart modified from the original table? printcp() gives the minimal cp for which the pruning happens. plotcp() plots against the geometric mean. (so it seems cp from plotcp is easy to use in prune() function; see also an example below if we want to select the cp which gives the minimum error)
- vignette (why is it not shown on CRAN). Page 3 shows what is included in a node depending on whether the response is binary, continuous or multi-class.
- rattle package
- tree package. cv.tree() function.
- R example. Assume bin=resistant & sensitive.
library(rpart) library(rpart.plot) tree <- rpart(bin ~ mgmt + mcm2, data = data, control = rpart.control(minsplit = 2, minbucket = 1)) tree plotcp(tree) # Plot the cost-complexity vs tree size printcp(tree) # CP nsplit rel error xerror xstd plot(tree) # base R text(tree, use.n = TRUE, cex = 1, xpd = NA) # Show n(resistant)/n(sensitive) at each terminal rpart.plot(tree) # Show predicted prob of sensitive # Show percentage of obs in the node cp <- tree$cptable[which.min(tree$cptable[, "xerror"]), "CP"] pruned_tree <- prune(tree, cp=cp); pruned_tree rpart.plot(pruned_tree)
parttree package
https://github.com/grantmcdermott/parttree
caret
- https://cran.r-project.org/web/packages/caret/. In the vignette, the pls method was used (so the pls package needs to be installed in order to run the example).
- The tune parameter tuneLength controls how many parameter values are evaluated,
- Alternatively, the tuneGrid argument is used when specific values are desired.
- caret ebook
- method = 'glmnet' only works for regression and classification problems; see train models by tag
- The caret Package -> Model Training and Tuning
- Predictive Modeling with R and the caret Package (useR! 2013)
- Caret R Package for Applied Predictive Modeling
- coefficients from glmnet and caret are different for the same lambda.
- the exact lambda you specified was not used by caret. the coefficients are interpolated from the coefficients actually calculated.
- when you provide lambda to the glmnet call the model returns exact coefficients for that lambda
library(caret) set.seed(0) train_control = trainControl(method = 'cv', number = 10) grid = 10 ^ seq(5, -2, length = 100) tune.grid = expand.grid(lambda = grid, alpha = 0) ridge.caret = train(x[train, ], y[train], method = 'glmnet', trControl = train_control, tuneGrid = tune.grid) ridge.caret$bestTune ridge_full <- train(x, y, method = 'glmnet', trControl = trainControl(method = 'none'), tuneGrid = expand.grid( lambda = ridge.caret$bestTune$lambda, alpha = 0) ) coef(ridge_full$finalModel, s = ridge.caret$bestTune$lambda)
- trainControl()
- method - "boot", "boot632", "optimism_boot", "boot_all", "cv", "repeatedcv", "LOOCV", "LGOCV", ...
- number - Either the number of folds or number of resampling iterations
- repeats - For repeated k-fold cross-validation only: the number of complete sets of folds to compute
- search - Either "grid" or "random", describing how the tuning parameter grid is determined.
- seeds - an optional set of integers that will be used to set the seed at each resampling iteration. This is useful when the models are run in parallel.
- train(). A long printout comes from the foreach loop when the nominalTrainWorkflow() function is called.
- method - A string specifying which classification or regression model to use. Some examples are knn, lm, nnet, rpart, glmboost, ... Possible values are found using names(getModelInfo()).
- metric - ifelse(is.factor(y_dat), "Accuracy", "RMSE"). By default, possible values are "RMSE" and "Rsquared" for regression and "Accuracy" and "Kappa" for classification. If custom performance metrics are used (via the summaryFunction argument in trainControl, the value of metric should match one of the arguments.
- maximize - ifelse(metric %in% c("RMSE", "logLoss", "MAE"), FALSE, TRUE)
- Return object from train()
- results: matrix. Each row = one parameter (alpha, lambda)
- finalModel$lambda: vector of very long length instead of what we specify. What is this?
- finalModel$lambdaOpt
- finalModel$tuneValue$alpha, finalModel$tuneValue$lambda
- finalModel$a0
- finalModel$beta
-
Regularization: Ridge, Lasso and Elastic Net. Notice the repeatedcv method. If repeatedcv is selected, the performance RMSE is computed by the average of (# CV * # reps) RMSEs for each (alpha, lambda). Then the best tuning parameter (alpha, lambda) is selected with the minimum RMSE in performance. See the related code at Line 679, Line 744,Line 759 where trControl$selectionFunction()=best().
library(glmnet) # for ridge regression library(dplyr) # for data cleaning data("mtcars") # Center y, X will be standardized in the modelling function y <- mtcars %>% select(mpg) %>% scale(center = TRUE, scale = FALSE) %>% as.matrix() X <- mtcars %>% select(-mpg) %>% as.matrix() dim(X) # [1] 32 10 library(caret) # Set training control train_control <- trainControl(method = "repeatedcv", number = 3, # 3-fold CV, 2 times repeats = 2, search = "grid", # "random" verboseIter = TRUE) # Train the model set.seed(123) # seed for reproducibility elastic_net_model <- train(mpg ~ ., data = cbind(y, X), method = "glmnet", preProcess = c("center", "scale"), tuneLength = 4, # 4 alphas x 4 lambdas trControl = train_control) # Check multiple R-squared y_hat_enet <- predict(elastic_net_model, X) cor(y, y_hat_enet)^2 names(elastic_net_model) # [1] "method" "modelInfo" "modelType" "results" "pred" # [6] "bestTune" "call" "dots" "metric" "control" # [11] "finalModel" "preProcess" "trainingData" "resample" "resampledCM" # [16] "perfNames" "maximize" "yLimits" "times" "levels" # [21] "terms" "coefnames" "xlevels" elastic_net_model$bestTune # alpha and lambda names(elastic_net_model$finalModel) # [1] "a0" "beta" "df" "dim" "lambda" # [6] "dev.ratio" "nulldev" "npasses" "jerr" "offset" # [11] "call" "nobs" "lambdaOpt" "xNames" "problemType" # [16] "tuneValue" "obsLevels" "param" length(elastic_net_model$finalModel$lambda) # [1] 95 length(elastic_net_model$finalModel$lambdaOpt) # [1] 1 dim(elastic_net_model$finalModel$beta) # [1] 10 95 which(elastic_net_model$finalModel$lambda == elastic_net_model$finalModel$lambdaOpt) # integer(0) <=== Weird, why
- For the Repeated k-fold Cross Validation. The final model accuracy is taken as the mean from the number of repeats.
- Penalized Regression Essentials: Ridge, Lasso & Elastic Net -> Using caret package. tuneLength was used. Note that the results element gives the details for each candidate of parameter combination.
# Load the data data("Boston", package = "MASS") # Split the data into training and test set set.seed(123) training.samples <- Boston$medv %>% createDataPartition(p = 0.8, list = FALSE) train.data <- Boston[training.samples, ] set.seed(123) # affect CV samples, not affect tuning parameter set cv_5 <- trainControl(method = "cv", number = 5) # number of folds model <- train( medv ~., data = train.data, method = "glmnet", trControl = cv_5, tuneLength = 10 # 10 alphas x 10 lambdas ) model # one row per parameter combination model$results # Best tuning parameter model$bestTune
seeds
CV
Purpose: estimate model accuracy
- K-Fold Cross Validation in R (Step-by-Step)
ctrl <- trainControl(method="LOOCV") # LOOCV ctrl <- trainControl(method = "cv", number = 5) # k-fold CV model <- train(y ~ x1 + x2, data = df, method = "lm", trControl = ctrl) # default method is 'rf' print(model) # RMSE, Rsquared, MAE model$finalModel # Coefficients model$resample # view predictions (RMSE, Rsquared, MAE) for each fold, 5 rows ctrl <- trainControl(method="repeatedcv", number=5, repeats=3) model <- train(y ~ x1 + x2, data = df, method = "lm", trControl = ctrl) # default method is 'rf' print(model) # RMSE, Rsquared, MAE model$finalModel # Coefficients model$resample # view predictions for each fold, 5*3 rows ctrl <- trainControl(method="boot", number=100) model <- train(y ~ x1 + x2, data = df, method = "lm", trControl = ctrl) # default method is 'rf' print(model) # RMSE, Rsquared, MAE model$finalModel # Coefficients model$resample # view predictions for each fold, 100 rows
- A basic tutorial of caret: the machine learning package in R
Question: Does the train() function from the caret package re-select features during cross-validation. Ans: The train() function in the caret package does not perform feature selection during cross-validation by default. However, you can use the rfe(recursive feature elimination) functions in the caret package to perform recursive feature selection with cross-validation.
Compare models
- Bland–Altman plot. It appears in the caret vignette. If we have paired data (e.g. one represents the AUC from the model A, and the other represents the AUC from the model B), we draw a scatterplot that X-axis = average, Y-axis = difference. M vs A plot.
Resources
- Caret Package – A Practical Guide to Machine Learning in R. Multivariate Adaptive Regression Splines (MARS) was used (method='earth'). Binary response.
- 3.1. How to split the dataset into training and validation?
- 3.2. Descriptive statistics (skimr package)
- 3.3. How to impute missing values using preProcess()?
- 4. How to visualize the importance of variables using featurePlot()
- 5. How to do feature selection using recursive feature elimination (rfe)?
- 6.1. How to train() the model and interpret the results?
- 6.3. Prepare the test dataset and predict
- 7.1. Setting up the trainControl()
- 7.2 Hyper Parameter Tuning using tuneLength
- 7.3. Hyper Parameter Tuning using tuneGrid
- 8.5. Run resamples() to compare the models (adaboost, random forest, xgBoost, SVM)
- A basic tutorial of caret: the machine learning package in R. Random forest (ranger package) was used. Binary response.
- A simple view of caret: the default train(), predict(), confusionMatrix().
- Pre-processing preProcess()
- Data splitting createDataPartition() and groupKFold()
- Resampling options trainContro()
- Model parameter tuning options tuneGrid()
- https://www.stat.colostate.edu/~jah/talks_public_html/isec2020/caret_package.html by Jennifer Hoeting. It has a link to Max Kuhn's talk from 2013.
- Machine learning in R with caret. diamonds data, n=53940, p=10.
- http://rafalab.dfci.harvard.edu/dsbook/caret.html
- knn and plot of accuracy vs neighbors from CV
- knn, gamLoess
- Beginner's Guide to Caret. Regression problem.
- Machine Learning in R with caret. n= 932, p=9.
- rf, knn, glmnet
- Predictive modeling and machine learning in R with the caret package. n=63, p=12.
- lm, rf, glmnet
- R Tutorial Series: Machine Learning with caret. GermanCredit data, n=1000, p=62.
- rf
- Videos
- An introduction to caret: A machine learning library in R from UseR Oslo
- The Caret package: your new best friend
- Machine Learning in R with caret : A tutorial for building and validating statistical models
- Modeling and Probability Analysis with GBM, GLMNET and CARET: ML with R
- STAT 432 /// The {caret} Package
- Intro to Machine Learning with R & caret from Data Science Dojo
- Part 3: Training and Tuning ML Models in R with "caret" | R Tutorial (2021), part 2 Feature Elimination and Variable Importance in R with "caret" (2021) by RichardOnData
Clinical prediction
- Evaluation of clinical prediction models (part 1): from development to external validation 2024
- Evaluation of clinical prediction models (part 2): how to undertake an external validation study
- Evaluation of clinical prediction models (part 3): calculating the sample size required for an external validation study
Books
- Supervised Machine Learning for Science How to stop worrying and love your black box