The linear_model.stepwise module
Both Method Variable Selection
- both_selection(formula, data, model, max_iter, formula_kwargs, fit_kwargs)
Both Forward and Backward Variable Selection for GLM’s
This function performs both forward and backward variable selection using the Akaike Information Criterion (AIC).
- param formula:
A string representing the initial model formula.
- type formula:
str
- param data:
A Pandas DataFrame containing the data to be used for model fitting.
- type data:
DataFrame
- param model:
A statsmodels.GLM object that represents the type of model to be fit.
- type model:
Union[GLM, OLS, Logit]
- param max_iter:
The maximum number of iterations to perform.
- type max_iter:
int
- param formula_kwargs:
Additional keyword arguments to be passed to the model.from_formula() method.
- type formula_kwargs:
dict
- param fit_kwargs:
Additional keyword arguments to be passed to the fit() method. Defaults to a dictionary
{"disp":0}
.- return:
A string representing the final model formula.
- rtype:
str
import statsmodels.api as sm import pandas as pd from estyp.linear_model.stepwise import both_selection data = pd.DataFrame({ "y": [1, 2, 3, 4, 5], "x1": [1, 2, 3, 4, 5], "x2": [6, 7, 8, 9, 10], }) formula = "y ~ x1 + x2" model = sm.OLS final_formula = both_selection(formula=formula, data=data, model=model) print(final_formula)
Made by Esteban Rucán. Contact me in LinkedIn: https://www.linkedin.com/in/estebanrucan/
y ~ x1
Forward Variable Selection
- forward_selection(y, data, model, alpha, formula_kwargs, fit_kwargs)
Forward Variable Selection for GLM’s
This function performs forward variable selection using p-values calculated from nested models testing.
- param y:
A string containing the name of the dependent variable (target) to be predicted.
- type y:
str
- param data:
The pandas DataFrame containing both the target variable ‘y’ and the predictor variables for model training.
- type data:
DataFrame
- param model:
A statsmodels model class. The statistical model to be used for model fitting and evaluation. Defaults to
sm.OLS
.- type model:
Union[GLM, OLS, Logit, LogisticRegression]
- param alpha:
A number between 0 and 1. The significance level for feature selection. A feature is added to the model if its p-value is less than this alpha value. Defaults to 0.05.
- type alpha:
float
- param formula_kwargs:
Additional keyword arguments to be passed to the model.from_formula() method. Defaults to
dict()
.- type formula_kwargs:
dict
- param fit_kwargs:
Additional keyword arguments to be passed to the fit() method. Defaults to a dictionary
{"disp":0}
.- type fit_kwargs:
dict
- return:
A string representing the final model formula.
- rtype:
str
import pandas as pd import statsmodels.api as sm from estyp.linear_model.stepwise import forward_selection # Create sample DataFrame data = pd.DataFrame({ 'y': [1, 2, 3, 4, 5], 'X1': [2, 4, 5, 7, 9], 'X2': [3, 1, 6, 8, 4], 'X3': [1, 5, 9, 2, 3] }) # Perform the forward variable selection formula = forward_selection( y = "y", data = data, model = sm.OLS, alpha = 0.05 ) # Fit the model using the selected formula selected_model = sm.OLS.from_formula(formula, data).fit() print(selected_model.summary())
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.990 Model: OLS Adj. R-squared: 0.986 Method: Least Squares F-statistic: 289.0 Date: Tue, 08 Aug 2023 Prob (F-statistic): 0.000443 Time: 00:04:48 Log-Likelihood: 2.6178 No. Observations: 5 AIC: -1.236 Df Residuals: 3 BIC: -2.017 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.1438 0.203 -0.710 0.529 -0.789 0.501 X1 0.5822 0.034 17.000 0.000 0.473 0.691 ============================================================================== Omnibus: nan Durbin-Watson: 2.488 Prob(Omnibus): nan Jarque-Bera (JB): 0.336 Skew: 0.389 Prob(JB): 0.845 Kurtosis: 1.998 Cond. No. 14.8 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.