K- Nearest Neighbors (KNN)#
pimmslearn - INFO Experiment 03 - Analysis of latent spaces and performance comparisions
Papermill script parameters:
# files and folders
folder_experiment: str = 'runs/example' # Datasplit folder with data for experiment
folder_data: str = '' # specify data directory if needed
file_format: str = 'csv' # file format of create splits, default pickle (pkl)
# Machine parsed metadata from rawfile workflow
fn_rawfile_metadata: str = 'data/dev_datasets/HeLa_6070/files_selected_metadata_N50.csv'
# training
epochs_max: int = 50 # Maximum number of epochs
# early_stopping:bool = True # Wheather to use early stopping or not
batch_size: int = 64 # Batch size for training (and evaluation)
cuda: bool = True # Whether to use a GPU for training
# model
neighbors: int = 3 # number of neigherst neighbors to use
force_train: bool = True # Force training when saved model could be used. Per default re-train model
sample_idx_position: int = 0 # position of index which is sample ID
model: str = 'KNN' # model name
model_key: str = 'KNN' # potentially alternative key for model (grid search)
save_pred_real_na: bool = True # Save all predictions for missing values
# metadata -> defaults for metadata extracted from machine data
meta_date_col: str = None # date column in meta data
meta_cat_col: str = None # category column in meta data
# Parameters
model = "KNN"
neighbors = 3
file_format = "csv"
fn_rawfile_metadata = "https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv"
folder_experiment = "runs/alzheimer_study"
model_key = "KNN"
Some argument transformations
{'batch_size': 64,
'cuda': True,
'data': Path('runs/alzheimer_study/data'),
'epochs_max': 50,
'file_format': 'csv',
'fn_rawfile_metadata': 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv',
'folder_data': '',
'folder_experiment': Path('runs/alzheimer_study'),
'force_train': True,
'meta_cat_col': None,
'meta_date_col': None,
'model': 'KNN',
'model_key': 'KNN',
'neighbors': 3,
'out_figures': Path('runs/alzheimer_study/figures'),
'out_folder': Path('runs/alzheimer_study'),
'out_metrics': Path('runs/alzheimer_study'),
'out_models': Path('runs/alzheimer_study'),
'out_preds': Path('runs/alzheimer_study/preds'),
'sample_idx_position': 0,
'save_pred_real_na': True}
Some naming conventions
Load data in long format#
pimmslearn.io.datasplits - INFO Loaded 'train_X' from file: runs/alzheimer_study/data/train_X.csv
pimmslearn.io.datasplits - INFO Loaded 'val_y' from file: runs/alzheimer_study/data/val_y.csv
pimmslearn.io.datasplits - INFO Loaded 'test_y' from file: runs/alzheimer_study/data/test_y.csv
data is loaded in long format
Sample ID protein groups
Sample_186 Q8NCC3 15.861
Sample_131 A0A087WX80;P24043 16.178
Sample_167 Q15375;Q15375-4 20.863
Sample_086 Q02818 17.039
Sample_066 B4E1Z4 18.368
Name: intensity, dtype: float64
load meta data for splits
| _collection site | _age at CSF collection | _gender | _t-tau [ng/L] | _p-tau [ng/L] | _Abeta-42 [ng/L] | _Abeta-40 [ng/L] | _Abeta-42/Abeta-40 ratio | _primary biochemical AD classification | _clinical AD diagnosis | _MMSE score | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||
| Sample_000 | Sweden | 71.000 | f | 703.000 | 85.000 | 562.000 | NaN | NaN | biochemical control | NaN | NaN |
| Sample_001 | Sweden | 77.000 | m | 518.000 | 91.000 | 334.000 | NaN | NaN | biochemical AD | NaN | NaN |
| Sample_002 | Sweden | 75.000 | m | 974.000 | 87.000 | 515.000 | NaN | NaN | biochemical AD | NaN | NaN |
| Sample_003 | Sweden | 72.000 | f | 950.000 | 109.000 | 394.000 | NaN | NaN | biochemical AD | NaN | NaN |
| Sample_004 | Sweden | 63.000 | f | 873.000 | 88.000 | 234.000 | NaN | NaN | biochemical AD | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Sample_205 | Berlin | 69.000 | f | 1,945.000 | NaN | 699.000 | 12,140.000 | 0.058 | biochemical AD | AD | 17.000 |
| Sample_206 | Berlin | 73.000 | m | 299.000 | NaN | 1,420.000 | 16,571.000 | 0.086 | biochemical control | non-AD | 28.000 |
| Sample_207 | Berlin | 71.000 | f | 262.000 | NaN | 639.000 | 9,663.000 | 0.066 | biochemical control | non-AD | 28.000 |
| Sample_208 | Berlin | 83.000 | m | 289.000 | NaN | 1,436.000 | 11,285.000 | 0.127 | biochemical control | non-AD | 24.000 |
| Sample_209 | Berlin | 63.000 | f | 591.000 | NaN | 1,299.000 | 11,232.000 | 0.116 | biochemical control | non-AD | 29.000 |
210 rows × 11 columns
Initialize Comparison#
protein groups
A0A024QZX5;A0A087X1N8;P35237 180
A0A024R0T9;K7ER74;P02655 196
A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 170
A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 194
A0A075B6H7 87
Name: intensity, dtype: int64
Simulated missing values#
The validation fake NA is used to by all models to evaluate training performance.
| observed | ||
|---|---|---|
| Sample ID | protein groups | |
| Sample_158 | Q9UN70;Q9UN70-2 | 14.630 |
| Sample_050 | Q9Y287 | 15.755 |
| Sample_107 | Q8N475;Q8N475-2 | 15.029 |
| Sample_199 | P06307 | 19.376 |
| Sample_067 | Q5VUB5 | 15.309 |
| ... | ... | ... |
| Sample_111 | F6SYF8;Q9UBP4 | 22.822 |
| Sample_002 | A0A0A0MT36 | 18.165 |
| Sample_049 | Q8WY21;Q8WY21-2;Q8WY21-3;Q8WY21-4 | 15.525 |
| Sample_182 | Q8NFT8 | 14.379 |
| Sample_123 | Q16853;Q16853-2 | 14.504 |
12600 rows × 1 columns
| observed | |
|---|---|
| count | 12,600.000 |
| mean | 16.339 |
| std | 2.741 |
| min | 7.209 |
| 25% | 14.412 |
| 50% | 15.935 |
| 75% | 17.910 |
| max | 30.140 |
Data in wide format#
| protein groups | A0A024QZX5;A0A087X1N8;P35237 | A0A024R0T9;K7ER74;P02655 | A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 | A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 | A0A075B6H7 | A0A075B6H9 | A0A075B6I0 | A0A075B6I1 | A0A075B6I6 | A0A075B6I9 | ... | Q9Y653;Q9Y653-2;Q9Y653-3 | Q9Y696 | Q9Y6C2 | Q9Y6N6 | Q9Y6N7;Q9Y6N7-2;Q9Y6N7-4 | Q9Y6R7 | Q9Y6X5 | Q9Y6Y8;Q9Y6Y8-2 | Q9Y6Y9 | S4R3U6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||||||||||||
| Sample_000 | 15.912 | 16.852 | 15.570 | 16.481 | 17.301 | 20.246 | 16.764 | 17.584 | 16.988 | 20.054 | ... | 16.012 | 15.178 | NaN | 15.050 | 16.842 | NaN | NaN | 19.563 | NaN | 12.805 |
| Sample_001 | NaN | 16.874 | 15.519 | 16.387 | NaN | 19.941 | 18.786 | 17.144 | NaN | 19.067 | ... | 15.528 | 15.576 | NaN | 14.833 | 16.597 | 20.299 | 15.556 | 19.386 | 13.970 | 12.442 |
| Sample_002 | 16.111 | NaN | 15.935 | 16.416 | 18.175 | 19.251 | 16.832 | 15.671 | 17.012 | 18.569 | ... | 15.229 | 14.728 | 13.757 | 15.118 | 17.440 | 19.598 | 15.735 | 20.447 | 12.636 | 12.505 |
| Sample_003 | 16.107 | 17.032 | 15.802 | 16.979 | 15.963 | 19.628 | 17.852 | 18.877 | 14.182 | 18.985 | ... | 15.495 | 14.590 | 14.682 | 15.140 | 17.356 | 19.429 | NaN | 20.216 | NaN | 12.445 |
| Sample_004 | 15.603 | 15.331 | 15.375 | 16.679 | NaN | 20.450 | 18.682 | 17.081 | 14.140 | 19.686 | ... | 14.757 | NaN | NaN | 15.256 | 17.075 | 19.582 | 15.328 | NaN | 13.145 | NaN |
5 rows × 1421 columns
Train#
model = ‘sklearn_knn’
Predictions#
data of training data set and validation dataset to create predictions is the same as training data.
predictions include missing values (which are not further compared)
create predictions and select for split entries
Sample ID protein groups
Sample_000 A0A024QZX5;A0A087X1N8;P35237 15.912
A0A024R0T9;K7ER74;P02655 16.852
A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 15.570
A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 16.481
A0A075B6H7 17.301
...
Sample_209 Q9Y6R7 19.275
Q9Y6X5 15.732
Q9Y6Y8;Q9Y6Y8-2 19.577
Q9Y6Y9 11.042
S4R3U6 11.791
Length: 298410, dtype: float64
| observed | KNN | ||
|---|---|---|---|
| Sample ID | protein groups | ||
| Sample_158 | Q9UN70;Q9UN70-2 | 14.630 | 15.427 |
| Sample_050 | Q9Y287 | 15.755 | 17.776 |
| Sample_107 | Q8N475;Q8N475-2 | 15.029 | 14.150 |
| Sample_199 | P06307 | 19.376 | 19.247 |
| Sample_067 | Q5VUB5 | 15.309 | 15.232 |
| ... | ... | ... | ... |
| Sample_111 | F6SYF8;Q9UBP4 | 22.822 | 22.884 |
| Sample_002 | A0A0A0MT36 | 18.165 | 16.857 |
| Sample_049 | Q8WY21;Q8WY21-2;Q8WY21-3;Q8WY21-4 | 15.525 | 15.840 |
| Sample_182 | Q8NFT8 | 14.379 | 13.685 |
| Sample_123 | Q16853;Q16853-2 | 14.504 | 14.612 |
12600 rows × 2 columns
| observed | KNN | ||
|---|---|---|---|
| Sample ID | protein groups | ||
| Sample_000 | A0A075B6P5;P01615 | 17.016 | 17.190 |
| A0A087X089;Q16627;Q16627-2 | 18.280 | 18.293 | |
| A0A0B4J2B5;S4R460 | 21.735 | 21.835 | |
| A0A140T971;O95865;Q5SRR8;Q5SSV3 | 14.603 | 15.172 | |
| A0A140TA33;A0A140TA41;A0A140TA52;P22105;P22105-3;P22105-4 | 16.143 | 16.625 | |
| ... | ... | ... | ... |
| Sample_209 | Q96ID5 | 16.074 | 15.909 |
| Q9H492;Q9H492-2 | 13.173 | 13.669 | |
| Q9HC57 | 14.207 | 13.962 | |
| Q9NPH3;Q9NPH3-2;Q9NPH3-5 | 14.962 | 15.094 | |
| Q9UGM5;Q9UGM5-2 | 16.871 | 16.255 |
12600 rows × 2 columns
save missing values predictions
Sample ID protein groups
Sample_000 A0A075B6J9 15.570
A0A075B6Q5 15.867
A0A075B6R2 16.478
A0A075B6S5 16.079
A0A087WSY4 16.367
...
Sample_209 Q9P1W8;Q9P1W8-2;Q9P1W8-4 15.928
Q9UI40;Q9UI40-2 16.640
Q9UIW2 17.463
Q9UMX0;Q9UMX0-2;Q9UMX0-4 13.007
Q9UP79 15.924
Name: intensity, Length: 46401, dtype: float64
Plots#
validation data
Comparisons#
Validation data#
all measured (identified, observed) peptides in validation data
Does not make to much sense to compare collab and AEs, as the setup differs of training and validation data differs
The fake NA for the validation step are real test data (not used for training nor early stopping)
Selected as truth to compare to: observed
{'KNN': {'MSE': 0.5506284398184521,
'MAE': 0.4813403662629595,
'N': 12600,
'prop': 1.0}}
Test Datasplit#
Fake NAs : Artificially created NAs. Some data was sampled and set explicitly to misssing before it was fed to the model for reconstruction.
Selected as truth to compare to: observed
{'KNN': {'MSE': 0.5474248505726204,
'MAE': 0.4817449433647998,
'N': 12600,
'prop': 1.0}}
Save all metrics as json
{ 'test_fake_na': { 'KNN': { 'MAE': 0.4817449433647998,
'MSE': 0.5474248505726204,
'N': 12600,
'prop': 1.0}},
'valid_fake_na': { 'KNN': { 'MAE': 0.4813403662629595,
'MSE': 0.5506284398184521,
'N': 12600,
'prop': 1.0}}}
| subset | valid_fake_na | test_fake_na | |
|---|---|---|---|
| model | metric_name | ||
| KNN | MSE | 0.551 | 0.547 |
| MAE | 0.481 | 0.482 | |
| N | 12,600.000 | 12,600.000 | |
| prop | 1.000 | 1.000 |
Save predictions#
Config#
{}
{'M': 1421,
'batch_size': 64,
'cuda': True,
'data': Path('runs/alzheimer_study/data'),
'epochs_max': 50,
'file_format': 'csv',
'fn_rawfile_metadata': 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv',
'folder_data': '',
'folder_experiment': Path('runs/alzheimer_study'),
'force_train': True,
'meta_cat_col': None,
'meta_date_col': None,
'model': 'KNN',
'model_key': 'KNN',
'n_params': 1,
'neighbors': 3,
'out_figures': Path('runs/alzheimer_study/figures'),
'out_folder': Path('runs/alzheimer_study'),
'out_metrics': Path('runs/alzheimer_study'),
'out_models': Path('runs/alzheimer_study'),
'out_preds': Path('runs/alzheimer_study/preds'),
'sample_idx_position': 0,
'save_pred_real_na': True}