Getting Started

First: Install ESTYP library if you haven’t already:

Here are some examples of how to use the library:

Model Selection

We will now select a logistic regression model that best classifies the versicolor category.

First, we load the data:

from sklearn.datasets import load_iris
import pandas as pd

content = load_iris()

data = pd.DataFrame(content.data, columns=[f"x{i+1}" for i in range(content.data.shape[1])])
data["y"] = (content.target == 1).astype(int)
print(data.head())
    x1   x2   x3   x4  y
0  5.1  3.5  1.4  0.2  0
1  4.9  3.0  1.4  0.2  0
2  4.7  3.2  1.3  0.2  0
3  4.6  3.1  1.5  0.2  0
4  5.0  3.6  1.4  0.2  0

Then, we run a model selection process with forward and both (forward and backward) steps:

Review LogisticRegression(), forward_selection() and both_selection() documentation for more information and other parameters.

from estyp.linear_model.stepwise import forward_selection, both_selection
from estyp.linear_model import LogisticRegression

formula = "y ~ x1 + x2 + x3 + x4"

ff1 = forward_selection(
    y       = "y",
    data    = data,
    model   = LogisticRegression,
    verbose = False,
)
ff2 = both_selection(
    formula = formula,
    data    = data,
    model   = LogisticRegression,
    verbose = False
)
print("- Forward result:", ff1)
print("- Both result   :", ff2)
Made by Esteban Rucán. Contact me in LinkedIn: https://www.linkedin.com/in/estebanrucan/
- Forward result: y ~ x2
- Both result   : y ~ x1 + x2 + x3 + x4

Now we choose between the two resultant models using nested models test:

View nested_models_test() documentation for more information and other parameters.

from estyp.testing import nested_models_test

model1 = LogisticRegression.from_formula(ff1, data).fit()
model2 = LogisticRegression.from_formula(ff2, data).fit()

nested_models_test(model1, model2) # First model is nested in the second one

    Nested models F-test
    F = 2.2843 | df: {'df_num': 3, 'df_den': 145} | p-value = 0.0814
    alternative hypothesis: big model is true
    sample estimates:
      Difference in deviances between models: 6.856231
    

With \(\alpha=0.05\), the null hypothesis is not rejected: model2 is significantly not better than model1.

Means equality between two samples

We will now test if the mean of the x1 and x4 columns are equal.

View the details of the t-test for more information.

Review t_test() documentation for more information and other parameters.

from estyp.testing import t_test

x = data["x1"]
y = data["x4"]

test_result = t_test(x, y)
print(test_result)

    Welch's Two Sample t-test
    T = 50.5360 | df: 295.98 | p-value = <0.0001
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
     4.463150 4.824850
    sample estimates:
      [mean of x, mean of y]: [5.843333, 1.199333]
    

With \(\alpha=0.05\), the null hypothesis is rejected: mean of x is significantly different from the mean of y.

Equality in variances of two samples

We will now test if the variance of the x1 and x4 columns are equal.

View the details of the variance test for more information.

Review var_test() documentation for more information and other parameters.

from estyp.testing import var_test

test_result = var_test(x, y)
print(test_result)

    F test to compare two variances
    F = 1.1802 | df: {'x': 149, 'y': 149} | p-value = 0.3130
    alternative hypothesis: true ratio of variances is not equal to 1
    95 percent confidence interval:
     0.854964 1.629111
    sample estimates:
      ratio of variances: 1.180183
    

With \(\alpha=0.05\), the null hypothesis is not rejected: variance of x is significantly equal from the variance of y.

Correlation between two samples

We will now test if the correlation between x1 and x4 is greater than 0.

Review cor_test() documentation for more information and other parameters.

from estyp.testing import cor_test

test_result = cor_test(x, y, alternative="greater", method="spearman")
print(test_result)

    Spearman's rank correlation rho
    S = 93208.4208 | p-value = <0.0001
    alternative hypothesis: true rho is greater than 0
    sample estimates:
      rho: 0.834289
    

With \(\alpha=0.05\), the null hypothesis is rejected: Spearman correlation between x and y is significantly greater than 0.

Proportions testing

We will now test if the proportion of non versicolor flowers is equal to 0.75.

Review prop_test() documentation for more information and other parameters.

from estyp.testing import prop_test

counts = data["y"].value_counts()

test_result = prop_test(counts, p=0.75)
print(test_result)

    1-sample test for equality of proportions with continuity correction
    X-squared = 5.1200 | df: 1 | p-value = 0.0237
    alternative hypothesis: the true proportion is not equal to 0.7500
    95 percent confidence interval:
     0.584468 0.740179
    sample estimates:
      proportion(s): 0.666667
    

With \(\alpha=0.05\), the null hypothesis is rejected: proportion of non versicolor flowers is not 0.75.

Searching Optimal Number of Clusters

We will now search for the optimal number of clusters in the iris dataset, powered by the elbow method.

Review NClusterSearch() documentation for more information and other parameters.

from estyp.cluster import NClusterSearch
from sklearn.cluster import KMeans

X = data.iloc[:, :-1].apply(lambda x: (x - x.mean()) / x.std())

searcher = NClusterSearch(
    estimator    = KMeans(n_init="auto"),
    method       = "elbow",
    random_state = 2023
)
searcher.fit(X)

print("- Clusters suggested: ", searcher.optimal_clusters_)
print("- Best estimator    : ", searcher.best_estimator_)
searcher.plot()
- Clusters suggested:  3
- Best estimator    :  KMeans(n_clusters=3, n_init='auto', random_state=2023)
_images/getting_started_8_1.png

The number of clusters suggested is 3.

Linear Regression Model Assumptions

We will now test the assumptions of a linear regression model.

Review CheckModel() documentation for more information and other parameters.

from estyp.testing import CheckModel
import statsmodels.api as sm

model = sm.OLS.from_formula('x4 ~ x1 + x2 + x3', data=data).fit()
checker = CheckModel(model)
checker.check_all()
Normality tests results:
- Residuals appear as normally distributed according to KS test (p-value = 0.545).
- Residuals appear as normally distributed according to Shapiro-Wilk test (p-value = 0.088).
- Residuals don't appear as normally distributed according to Jarque-Bera test (p-value = 0.033).
- Residuals appear as normally distributed according to Omni test (p-value = 0.061).
_images/getting_started_9_1.png
Homocedasticity tests results:
- Heteroscedasticity (non-constant error variance) detected according to Breusch-Pagan test (p-value = 0.000).
- Heteroscedasticity (non-constant error variance) detected according to White test (p-value = 0.004).
- Heteroscedasticity (non-constant error variance) detected according to Goldfeld-Quandt test (p-value = 0.000).
_images/getting_started_9_3.png
Independence tests results:
- Residuals appear to be independent and not autocorrelated according to DW test (DW-Statistic = 1.573)
- Autocorrelated residuals detected according to Box-Pierce test (p-value = 0.008).
- Autocorrelated residuals detected according to Breusch-Godfrey test (p-value = 0.038).
_images/getting_started_9_5.png
Multicollinearity test results:
- The model may have multicollinearity problems (condition number = 90.12).
_images/getting_started_9_7.png

Apparently we only approve the residuals normality assumption.