Getting Started
First: Install ESTYP library if you haven’t already:
Here are some examples of how to use the library:
Model Selection
We will now select a logistic regression model that best classifies the versicolor category.
First, we load the data:
from sklearn.datasets import load_iris
import pandas as pd
content = load_iris()
data = pd.DataFrame(content.data, columns=[f"x{i+1}" for i in range(content.data.shape[1])])
data["y"] = (content.target == 1).astype(int)
print(data.head())
x1 x2 x3 x4 y
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
Then, we run a model selection process with forward and both (forward and backward) steps:
Review LogisticRegression(), forward_selection() and both_selection() documentation for more information and other parameters.
from estyp.linear_model.stepwise import forward_selection, both_selection
from estyp.linear_model import LogisticRegression
formula = "y ~ x1 + x2 + x3 + x4"
ff1 = forward_selection(
y = "y",
data = data,
model = LogisticRegression,
verbose = False,
)
ff2 = both_selection(
formula = formula,
data = data,
model = LogisticRegression,
verbose = False
)
print("- Forward result:", ff1)
print("- Both result :", ff2)
Made by Esteban Rucán. Contact me in LinkedIn: https://www.linkedin.com/in/estebanrucan/
- Forward result: y ~ x2
- Both result : y ~ x1 + x2 + x3 + x4
Now we choose between the two resultant models using nested models test:
View nested_models_test() documentation for more information and other parameters.
from estyp.testing import nested_models_test
model1 = LogisticRegression.from_formula(ff1, data).fit()
model2 = LogisticRegression.from_formula(ff2, data).fit()
nested_models_test(model1, model2) # First model is nested in the second one
[1m[4mNested models F-test[0m
F = 2.2843 | df: {'df_num': 3, 'df_den': 145} | p-value = 0.0814
alternative hypothesis: big model is true
sample estimates:
Difference in deviances between models: 6.856231
With \(\alpha=0.05\), the null hypothesis is not rejected: model2 is significantly not better than model1.
Means equality between two samples
We will now test if the mean of the x1 and x4 columns are equal.
View the details of the t-test for more information.
Review t_test() documentation for more information and other parameters.
from estyp.testing import t_test
x = data["x1"]
y = data["x4"]
test_result = t_test(x, y)
print(test_result)
[1m[4mWelch's Two Sample t-test[0m
T = 50.5360 | df: 295.98 | p-value = <0.0001
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4.463150 4.824850
sample estimates:
[mean of x, mean of y]: [5.843333, 1.199333]
With \(\alpha=0.05\), the null hypothesis is rejected: mean of x is significantly different from the mean of y.
Equality in variances of two samples
We will now test if the variance of the x1 and x4 columns are equal.
View the details of the variance test for more information.
Review var_test() documentation for more information and other parameters.
from estyp.testing import var_test
test_result = var_test(x, y)
print(test_result)
[1m[4mF test to compare two variances[0m
F = 1.1802 | df: {'x': 149, 'y': 149} | p-value = 0.3130
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.854964 1.629111
sample estimates:
ratio of variances: 1.180183
With \(\alpha=0.05\), the null hypothesis is not rejected: variance of x is significantly equal from the variance of y.
Correlation between two samples
We will now test if the correlation between x1 and x4 is greater than 0.
Review cor_test() documentation for more information and other parameters.
from estyp.testing import cor_test
test_result = cor_test(x, y, alternative="greater", method="spearman")
print(test_result)
[1m[4mSpearman's rank correlation rho[0m
S = 93208.4208 | p-value = <0.0001
alternative hypothesis: true rho is greater than 0
sample estimates:
rho: 0.834289
With \(\alpha=0.05\), the null hypothesis is rejected: Spearman correlation between x and y is significantly greater than 0.
Proportions testing
We will now test if the proportion of non versicolor flowers is equal to 0.75.
Review prop_test() documentation for more information and other parameters.
from estyp.testing import prop_test
counts = data["y"].value_counts()
test_result = prop_test(counts, p=0.75)
print(test_result)
[1m[4m1-sample test for equality of proportions with continuity correction[0m
X-squared = 5.1200 | df: 1 | p-value = 0.0237
alternative hypothesis: the true proportion is not equal to 0.7500
95 percent confidence interval:
0.584468 0.740179
sample estimates:
proportion(s): 0.666667
With \(\alpha=0.05\), the null hypothesis is rejected: proportion of non versicolor flowers is not 0.75.
Searching Optimal Number of Clusters
We will now search for the optimal number of clusters in the iris dataset, powered by the elbow method.
Review NClusterSearch() documentation for more information and other parameters.
from estyp.cluster import NClusterSearch
from sklearn.cluster import KMeans
X = data.iloc[:, :-1].apply(lambda x: (x - x.mean()) / x.std())
searcher = NClusterSearch(
estimator = KMeans(n_init="auto"),
method = "elbow",
random_state = 2023
)
searcher.fit(X)
print("- Clusters suggested: ", searcher.optimal_clusters_)
print("- Best estimator : ", searcher.best_estimator_)
searcher.plot()
- Clusters suggested: 3
- Best estimator : KMeans(n_clusters=3, n_init='auto', random_state=2023)

The number of clusters suggested is 3.
Linear Regression Model Assumptions
We will now test the assumptions of a linear regression model.
Review CheckModel() documentation for more information and other parameters.
from estyp.testing import CheckModel
import statsmodels.api as sm
model = sm.OLS.from_formula('x4 ~ x1 + x2 + x3', data=data).fit()
checker = CheckModel(model)
checker.check_all()
[1m[4mNormality tests results:[0m
[92m- Residuals appear as normally distributed according to KS test (p-value = 0.545).[0m
[92m- Residuals appear as normally distributed according to Shapiro-Wilk test (p-value = 0.088).[0m
[91m- Residuals don't appear as normally distributed according to Jarque-Bera test (p-value = 0.033).[0m
[92m- Residuals appear as normally distributed according to Omni test (p-value = 0.061).[0m

[1m[4mHomocedasticity tests results:[0m
[91m- Heteroscedasticity (non-constant error variance) detected according to Breusch-Pagan test (p-value = 0.000).[0m
[91m- Heteroscedasticity (non-constant error variance) detected according to White test (p-value = 0.004).[0m
[91m- Heteroscedasticity (non-constant error variance) detected according to Goldfeld-Quandt test (p-value = 0.000).[0m

[1m[4mIndependence tests results:[0m
[92m- Residuals appear to be independent and not autocorrelated according to DW test (DW-Statistic = 1.573)[0m
[91m- Autocorrelated residuals detected according to Box-Pierce test (p-value = 0.008).[0m
[91m- Autocorrelated residuals detected according to Breusch-Godfrey test (p-value = 0.038).[0m

[1m[4mMulticollinearity test results:[0m
[91m- The model may have multicollinearity problems (condition number = 90.12).[0m

Apparently we only approve the residuals normality assumption.