2. SVM_performance.py

2.1. Description

Calculating performance metrics using K-fold cross-validation.

  • F1_micro

  • F1_macro

  • Accuracy

  • Precision

  • Recall

2.2. Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input_file=INPUT_FILE

Tab or space separated file. The first column contains sample IDs; the second column contains sample labels in integer (must be 0 or 1); the third column contains sample label names (string, must be consistent with column-2). The remaining columns contain featuers used to build SVM model.

-n N_FOLD, --nfold=N_FOLD

The original sample is randomly partitioned into n equal sized subsamples (2 =< n <= 10). Of the n subsamples, a single subsample is retained as the validation data for testing the model, and the remaining n − 1 subsamples are used as training data. default=5.

-p N_THREAD, --nthread=N_THREAD

Number of threads to use. default=2

-C C_VALUE, --cvalue=C_VALUE

C value. default=1.0

-k S_KERNEL, --kernel=S_KERNEL

Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. default=linear

2.3. Input files format

ID

Label

Label_name

feature_1

feature_2

feature_3

feature_n

sample_1

1

WT

1560

795

0.9716

feature_n

sample_2

1

WT

784

219

0.4087

feature_n

sample_3

1

WT

2661

2268

1.1691

feature_n

sample_4

0

Mut

643

198

0.5458

feature_n

sample_5

0

Mut

534

87

1.0545

feature_n

sample_6

0

Mut

332

75

0.5115

feature_n

2.4. Example of input file

$ cat lung_CES_5features.tsv
TCGA_ID Label   Group   gsva_p53_activated      gsva_p53_repressed      ssGSEA_p53_activated    ssGSEA_p53_repressed    PC1
TCGA-22-4593-11A        0       Normal  0.97337963      -0.965872505    0.446594884     -0.332230329    10.12036762
TCGA-22-4609-11A        0       Normal  0.974507532     -0.971830001    0.480743696     -0.373937866    12.57932272
TCGA-22-5471-11A        0       Normal  0.981934732     -0.991054313    0.465087717     -0.354705367    11.50908022
TCGA-22-5472-11A        0       Normal  0.914660832     -0.889643616    0.433541263     -0.316566781    7.96785884
TCGA-22-5478-11A        0       Normal  0.983080513     -0.989789407    0.478239013     -0.370840097    11.81998124
TCGA-22-5481-11A        0       Normal  0.958950969     -0.973021839    0.441116626     -0.325822867    10.62201083
TCGA-22-5482-11A        0       Normal  0.97113164      -0.976324136    0.471515295     -0.362373723    10.78576876
TCGA-22-5483-11A        0       Normal  0.957377049     -0.986013986    0.378674475     -0.253223408    7.487083257
TCGA-22-5489-11A        0       Normal  0.963911525     -0.982725528    0.45219094      -0.339061168    9.49806089
TCGA-22-5491-11A        0       Normal  0.981934732     -0.991054313    0.475345705     -0.367218333    12.2813137
TCGA-33-4587-11A        0       Normal  0.90739615      -0.930774072    0.403446401     -0.281428331    9.368460346
TCGA-33-6737-11A        0       Normal  0.962025316     -0.957522049    0.495340808     -0.391557543    10.79155095
TCGA-34-7107-11A        0       Normal  0.949717514     -0.934120795    0.451010344     -0.337452999    10.04177079
TCGA-34-8454-11A        0       Normal  0.992397661     -0.987269255    0.480060883     -0.372603029    10.6050578
...

2.5. Command

$  python3  SVM_performance.py -i lung_CES_5features.tsv -C 10

Note

There is no rule of thumb to choose a C value, people can try a bunch of different C values and choose the one which gives you “best performance scores”

2.6. Output to screen

Preprocessing data ...
Evaluate metric(s) by cross-validation ...
F1 score is the weighted average of the precision and recall. F1 = 2 * (precision * recall) / (precision + recall)


F1_macro calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
       Iteration 1: 1.000000
       Iteration 2: 0.983518
       Iteration 3: 1.000000
       Iteration 4: 1.000000
       Iteration 5: 0.967273
F1-macro: 0.9902 (+/- 0.0262)


F1_micro calculate metrics globally by counting the total true positives, false negatives and false positives.
       Iteration 1: 1.000000
       Iteration 2: 0.986301
       Iteration 3: 1.000000
       Iteration 4: 1.000000
       Iteration 5: 0.972222
F1-micro: 0.9917 (+/- 0.0222)


accuracy is equal to F1_micro for binary classification problem
       Iteration 1: 1.000000
       Iteration 2: 0.986301
       Iteration 3: 1.000000
       Iteration 4: 1.000000
       Iteration 5: 0.972222
Accuracy: 0.9917 (+/- 0.0222)


Precision = tp / (tp + fp). It measures "out of all *predictive positives*, how many are correctly predicted?"
       Iteration 1: 1.000000
       Iteration 2: 1.000000
       Iteration 3: 1.000000
       Iteration 4: 1.000000
       Iteration 5: 1.000000
Precision: 1.0000 (+/- 0.0000)


Recall = tp / (tp + fn). Recall (i.e. sensitivity) measures "out of all  *positives*, how many are correctly predicted?"
       Iteration 1: 1.000000
       Iteration 2: 0.980769
       Iteration 3: 1.000000
       Iteration 4: 1.000000
       Iteration 5: 0.960784
Recall: 0.9883 (+/- 0.0313)