Prediction: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 1: Line 1:
= Rules =
* What it takes to develop a successful (clinical) prediction model. [https://twitter.com/MaartenvSmeden/status/1655920315350884357?s=20 TEN important things to avoid]
= Feature selection =
= Feature selection =
* [http://r-statistics.co/Variable-Selection-and-Importance-With-R.html Feature Selection Approaches] from http://r-statistics.co
* [http://r-statistics.co/Variable-Selection-and-Importance-With-R.html Feature Selection Approaches] from http://r-statistics.co

Revision as of 14:46, 9 May 2023

Rules

Feature selection

Recursive feature elimination

  • Feature Recursive Elimination (FRE) is a feature selection algorithm developed by Isabelle Guyon and her colleagues. It is a recursive algorithm that is used to select a subset of relevant features from a large pool of candidate features in a dataset.
  • The basic idea behind FRE is to recursively eliminate features that are least informative or least relevant to the prediction task. At each iteration, the algorithm removes the feature with the lowest contribution to the prediction performance, as measured by a performance metric such as accuracy or F1-score. The algorithm continues to eliminate features until a stopping criterion is met, such as a fixed number of features, a minimum prediction performance, or a user-defined stopping threshold.
  • FRE has been applied to a variety of machine learning problems, including classification, regression, and clustering. It is often used in combination with other feature selection algorithms, such as wrapper methods or filter methods, to provide a more comprehensive and robust approach to feature selection.
  • One advantage of FRE is that it provides a simple and straightforward way to select features that are most relevant to the prediction task. It also allows for the evaluation of the importance of individual features and provides a way to visualize the relationship between the features and the prediction performance. Another advantage of FRE is that it is computationally efficient and can handle large datasets with many features.
  • R packages
  • In mathematical terms, RFE can be formulated as follows:
    1. Initialize the feature set F to contain all features in the dataset X.
    2. Fit a model, such as a Support Vector Machine (SVM), to the data X using the current feature set F.
    3. Rank the features in F based on their importance, as determined by the model coefficients or other feature importance measures.
    4. Remove the feature with the lowest importance from the feature set F.
    5. Repeat steps 2-4 until a stopping criterion is reached.
    The stopping criterion can be defined as a fixed number of features to be included in the final model, a certain threshold of cross-validation accuracy, or a threshold of classification error, among others.
    At each iteration of the RFE process, the model is refitted with the remaining features, and the importance of each feature is re-evaluated. By removing the least important features at each iteration, the RFE process can identify a subset of the most important features that contribute to the prediction performance of the model.

SVM RFE

  • Support Vector Machines (SVMs) can be used to perform Recursive Feature Elimination (RFE). RFE is a feature selection method that involves iteratively removing the least important features from a dataset and re-fitting a model until a stopping criterion is reached. The goal of RFE is to identify a subset of the most important features that can be used to build a predictive model with good accuracy.
  • SVMs are a type of machine learning algorithm that can be used for classification and regression tasks. They work by finding the hyperplane that maximizes the margin between the classes in a dataset. The hyperplane is defined by a subset of the features, called support vectors, that have the largest influence on the classification decision.
  • To perform RFE with SVMs, one can use the support vectors as a measure of feature importance and remove the features with the smallest magnitude of coefficients in the SVM model. At each iteration, the SVM model is refitted with the remaining features and the process is repeated until a stopping criterion is reached.
  • In this way, RFE with SVMs can be used to identify a subset of the most important features that contribute to the prediction performance of the SVM model. RFE with SVMs can also be used to handle high-dimensional datasets with many features, as it can help reduce the dimensionality of the data and improve the interpretability of the model.
  • Common stopping criteria that are used in RFE with SVMs, including:
    • Number of Features: One common stopping criterion is to specify a fixed number of features to be included in the final model. For example, one can specify that the RFE process should stop when the number of features is reduced to a certain value, such as 10 or 50.
    • Cross-Validation Accuracy: Another common stopping criterion is to use cross-validation accuracy as a measure of performance. The RFE process can be stopped when the cross-validation accuracy reaches a certain threshold, or when it starts to decrease, indicating that further feature elimination is not beneficial for improving performance.
    • Classification Error: A third common stopping criterion is to use the classification error, or misclassification rate, as a measure of performance. The RFE process can be stopped when the classification error reaches a certain threshold, or when it starts to increase, indicating that further feature elimination is not beneficial for improving performance.
    • The choice of stopping criterion will depend on the specific requirements of the user and the characteristics of the dataset. It is important to select a stopping criterion that balances the need for feature reduction with the need to preserve performance and avoid overfitting.
  • An example:
    library(caret)
    data(iris)
    
    # Split the data into training and testing sets
    set.seed(123)
    train_index <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
    train_data <- iris[train_index, ]
    test_data <- iris[-train_index, ]
    
    # Preprocess the data
    preProcess_obj <- preProcess(train_data[, -5], method = c("center", "scale"))
    train_data_preprocessed <- predict(preProcess_obj, train_data[, -5])
    train_data_preprocessed$Species <- train_data$Species
    
    # Perform Recursive Feature Elimination with a SVM classifier
    ctrl <- rfeControl(functions = rfFuncs, method = "repeatedcv", repeats = 3, verbose = FALSE)
    svm_rfe_model <- rfe(x = train_data_preprocessed[, -5], 
                         y = train_data_preprocessed$Species, 
                         sizes = c(1:4), 
                         rfeControl = ctrl, 
                         method = "svmLinear")
    print(svm_rfe_model)
    # Recursive feature selection
    #
    # Outer resampling method: Cross-Validated (10 fold, repeated 3 times) 
    # 
    # Resampling performance over subset size:
    #
    #  Variables Accuracy  Kappa AccuracySD KappaSD Selected
    #          1   0.9583 0.9375    0.06092 0.09139         
    #          2   0.9611 0.9417    0.06086 0.09129        *
    #          3   0.9583 0.9375    0.06092 0.09139         
    #          4   0.9583 0.9375    0.06092 0.09139         
    #
    # The top 2 variables (out of 2):
    #    Petal.Width, Petal.Length
    
  • The rfe function in the caret package in R can be used with many different classifier methods (Full list), including:
    • svmLinear: Linear Support Vector Machine (SVM)
    • svmRadial: Radial Support Vector Machine (SVM)
    • knn: k-Nearest Neighbors (k-NN)
    • rpart: Recursive Partitioning (RPART)
    • glm: Generalized Linear Model (GLM)
    • glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models (GLMNET)
    • xgbLinear: Extreme Gradient Boosting (XGBoost) with a linear objective
    • xgbTree: Extreme Gradient Boosting (XGBoost) with a tree-based objective
    • randomForest: Random Forest
    • gbm: Gradient Boosting Machines (GBM)
    • ctree: Conditional Inference Trees (CTREE)

Boruta

knn

Random forest

Gradient boost

GBDT: Gradient Boosting Decision Trees