2. SVM_performance.py¶
2.1. Description¶
Calculating performance metrics using K-fold cross-validation.
F1_micro
F1_macro
Accuracy
Precision
Recall
2.2. Options¶
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -i INPUT_FILE, --input_file=INPUT_FILE
Tab or space separated file. The first column contains sample IDs; the second column contains sample labels in integer (must be 0 or 1); the third column contains sample label names (string, must be consistent with column-2). The remaining columns contain featuers used to build SVM model.
- -n N_FOLD, --nfold=N_FOLD
The original sample is randomly partitioned into n equal sized subsamples (2 =< n <= 10). Of the n subsamples, a single subsample is retained as the validation data for testing the model, and the remaining n − 1 subsamples are used as training data. default=5.
- -p N_THREAD, --nthread=N_THREAD
Number of threads to use. default=2
- -C C_VALUE, --cvalue=C_VALUE
C value. default=1.0
- -k S_KERNEL, --kernel=S_KERNEL
Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. default=linear
2.3. Input files format¶
ID |
Label |
Label_name |
feature_1 |
feature_2 |
feature_3 |
… |
feature_n |
sample_1 |
1 |
WT |
1560 |
795 |
0.9716 |
… |
feature_n |
sample_2 |
1 |
WT |
784 |
219 |
0.4087 |
… |
feature_n |
sample_3 |
1 |
WT |
2661 |
2268 |
1.1691 |
… |
feature_n |
sample_4 |
0 |
Mut |
643 |
198 |
0.5458 |
… |
feature_n |
sample_5 |
0 |
Mut |
534 |
87 |
1.0545 |
… |
feature_n |
sample_6 |
0 |
Mut |
332 |
75 |
0.5115 |
… |
feature_n |
2.4. Example of input file¶
$ cat lung_CES_5features.tsv
TCGA_ID Label Group gsva_p53_activated gsva_p53_repressed ssGSEA_p53_activated ssGSEA_p53_repressed PC1
TCGA-22-4593-11A 0 Normal 0.97337963 -0.965872505 0.446594884 -0.332230329 10.12036762
TCGA-22-4609-11A 0 Normal 0.974507532 -0.971830001 0.480743696 -0.373937866 12.57932272
TCGA-22-5471-11A 0 Normal 0.981934732 -0.991054313 0.465087717 -0.354705367 11.50908022
TCGA-22-5472-11A 0 Normal 0.914660832 -0.889643616 0.433541263 -0.316566781 7.96785884
TCGA-22-5478-11A 0 Normal 0.983080513 -0.989789407 0.478239013 -0.370840097 11.81998124
TCGA-22-5481-11A 0 Normal 0.958950969 -0.973021839 0.441116626 -0.325822867 10.62201083
TCGA-22-5482-11A 0 Normal 0.97113164 -0.976324136 0.471515295 -0.362373723 10.78576876
TCGA-22-5483-11A 0 Normal 0.957377049 -0.986013986 0.378674475 -0.253223408 7.487083257
TCGA-22-5489-11A 0 Normal 0.963911525 -0.982725528 0.45219094 -0.339061168 9.49806089
TCGA-22-5491-11A 0 Normal 0.981934732 -0.991054313 0.475345705 -0.367218333 12.2813137
TCGA-33-4587-11A 0 Normal 0.90739615 -0.930774072 0.403446401 -0.281428331 9.368460346
TCGA-33-6737-11A 0 Normal 0.962025316 -0.957522049 0.495340808 -0.391557543 10.79155095
TCGA-34-7107-11A 0 Normal 0.949717514 -0.934120795 0.451010344 -0.337452999 10.04177079
TCGA-34-8454-11A 0 Normal 0.992397661 -0.987269255 0.480060883 -0.372603029 10.6050578
...
2.5. Command¶
$ python3 SVM_performance.py -i lung_CES_5features.tsv -C 10
Note
There is no rule of thumb to choose a C value, people can try a bunch of different C values and choose the one which gives you “best performance scores”
2.6. Output to screen¶
Preprocessing data ...
Evaluate metric(s) by cross-validation ...
F1 score is the weighted average of the precision and recall. F1 = 2 * (precision * recall) / (precision + recall)
F1_macro calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
Iteration 1: 1.000000
Iteration 2: 0.983518
Iteration 3: 1.000000
Iteration 4: 1.000000
Iteration 5: 0.967273
F1-macro: 0.9902 (+/- 0.0262)
F1_micro calculate metrics globally by counting the total true positives, false negatives and false positives.
Iteration 1: 1.000000
Iteration 2: 0.986301
Iteration 3: 1.000000
Iteration 4: 1.000000
Iteration 5: 0.972222
F1-micro: 0.9917 (+/- 0.0222)
accuracy is equal to F1_micro for binary classification problem
Iteration 1: 1.000000
Iteration 2: 0.986301
Iteration 3: 1.000000
Iteration 4: 1.000000
Iteration 5: 0.972222
Accuracy: 0.9917 (+/- 0.0222)
Precision = tp / (tp + fp). It measures "out of all *predictive positives*, how many are correctly predicted?"
Iteration 1: 1.000000
Iteration 2: 1.000000
Iteration 3: 1.000000
Iteration 4: 1.000000
Iteration 5: 1.000000
Precision: 1.0000 (+/- 0.0000)
Recall = tp / (tp + fn). Recall (i.e. sensitivity) measures "out of all *positives*, how many are correctly predicted?"
Iteration 1: 1.000000
Iteration 2: 0.980769
Iteration 3: 1.000000
Iteration 4: 1.000000
Iteration 5: 0.960784
Recall: 0.9883 (+/- 0.0313)