K- Nearest Neighbors (KNN)#
pimmslearn - INFO Experiment 03 - Analysis of latent spaces and performance comparisions
Papermill script parameters:
# files and folders
folder_experiment: str = 'runs/example' # Datasplit folder with data for experiment
folder_data: str = '' # specify data directory if needed
file_format: str = 'csv' # file format of create splits, default pickle (pkl)
# Machine parsed metadata from rawfile workflow
fn_rawfile_metadata: str = 'data/dev_datasets/HeLa_6070/files_selected_metadata_N50.csv'
# training
epochs_max: int = 50 # Maximum number of epochs
# early_stopping:bool = True # Wheather to use early stopping or not
batch_size: int = 64 # Batch size for training (and evaluation)
cuda: bool = True # Whether to use a GPU for training
# model
neighbors: int = 3 # number of neigherst neighbors to use
force_train: bool = True # Force training when saved model could be used. Per default re-train model
sample_idx_position: int = 0 # position of index which is sample ID
model: str = 'KNN' # model name
model_key: str = 'KNN' # potentially alternative key for model (grid search)
save_pred_real_na: bool = True # Save all predictions for missing values
# metadata -> defaults for metadata extracted from machine data
meta_date_col: str = None # date column in meta data
meta_cat_col: str = None # category column in meta data
# Parameters
model = "KNN"
neighbors = 5
file_format = "csv"
fn_rawfile_metadata = "https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv"
folder_experiment = "runs/alzheimer_study"
model_key = "KNN5"
Some argument transformations
{'batch_size': 64,
'cuda': True,
'data': Path('runs/alzheimer_study/data'),
'epochs_max': 50,
'file_format': 'csv',
'fn_rawfile_metadata': 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv',
'folder_data': '',
'folder_experiment': Path('runs/alzheimer_study'),
'force_train': True,
'meta_cat_col': None,
'meta_date_col': None,
'model': 'KNN',
'model_key': 'KNN5',
'neighbors': 5,
'out_figures': Path('runs/alzheimer_study/figures'),
'out_folder': Path('runs/alzheimer_study'),
'out_metrics': Path('runs/alzheimer_study'),
'out_models': Path('runs/alzheimer_study'),
'out_preds': Path('runs/alzheimer_study/preds'),
'sample_idx_position': 0,
'save_pred_real_na': True}
Some naming conventions
Load data in long format#
pimmslearn.io.datasplits - INFO Loaded 'train_X' from file: runs/alzheimer_study/data/train_X.csv
pimmslearn.io.datasplits - INFO Loaded 'val_y' from file: runs/alzheimer_study/data/val_y.csv
pimmslearn.io.datasplits - INFO Loaded 'test_y' from file: runs/alzheimer_study/data/test_y.csv
data is loaded in long format
Sample ID protein groups
Sample_049 O95467 14.593
Sample_093 A2A2V1;P04156;P04156-2 21.897
Sample_151 A0A1B0GVB9;A0A1C7CYW4;O75787;O75787-2 19.224
Sample_060 Q969P0;Q969P0-3 18.653
Sample_148 Q8IZS8 14.033
Name: intensity, dtype: float64
load meta data for splits
| _collection site | _age at CSF collection | _gender | _t-tau [ng/L] | _p-tau [ng/L] | _Abeta-42 [ng/L] | _Abeta-40 [ng/L] | _Abeta-42/Abeta-40 ratio | _primary biochemical AD classification | _clinical AD diagnosis | _MMSE score | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||
| Sample_000 | Sweden | 71.000 | f | 703.000 | 85.000 | 562.000 | NaN | NaN | biochemical control | NaN | NaN |
| Sample_001 | Sweden | 77.000 | m | 518.000 | 91.000 | 334.000 | NaN | NaN | biochemical AD | NaN | NaN |
| Sample_002 | Sweden | 75.000 | m | 974.000 | 87.000 | 515.000 | NaN | NaN | biochemical AD | NaN | NaN |
| Sample_003 | Sweden | 72.000 | f | 950.000 | 109.000 | 394.000 | NaN | NaN | biochemical AD | NaN | NaN |
| Sample_004 | Sweden | 63.000 | f | 873.000 | 88.000 | 234.000 | NaN | NaN | biochemical AD | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Sample_205 | Berlin | 69.000 | f | 1,945.000 | NaN | 699.000 | 12,140.000 | 0.058 | biochemical AD | AD | 17.000 |
| Sample_206 | Berlin | 73.000 | m | 299.000 | NaN | 1,420.000 | 16,571.000 | 0.086 | biochemical control | non-AD | 28.000 |
| Sample_207 | Berlin | 71.000 | f | 262.000 | NaN | 639.000 | 9,663.000 | 0.066 | biochemical control | non-AD | 28.000 |
| Sample_208 | Berlin | 83.000 | m | 289.000 | NaN | 1,436.000 | 11,285.000 | 0.127 | biochemical control | non-AD | 24.000 |
| Sample_209 | Berlin | 63.000 | f | 591.000 | NaN | 1,299.000 | 11,232.000 | 0.116 | biochemical control | non-AD | 29.000 |
210 rows × 11 columns
Initialize Comparison#
protein groups
A0A024QZX5;A0A087X1N8;P35237 180
A0A024R0T9;K7ER74;P02655 196
A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 170
A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 194
A0A075B6H7 87
Name: intensity, dtype: int64
Simulated missing values#
The validation fake NA is used to by all models to evaluate training performance.
| observed | ||
|---|---|---|
| Sample ID | protein groups | |
| Sample_158 | Q9UN70;Q9UN70-2 | 14.630 |
| Sample_050 | Q9Y287 | 15.755 |
| Sample_107 | Q8N475;Q8N475-2 | 15.029 |
| Sample_199 | P06307 | 19.376 |
| Sample_067 | Q5VUB5 | 15.309 |
| ... | ... | ... |
| Sample_111 | F6SYF8;Q9UBP4 | 22.822 |
| Sample_002 | A0A0A0MT36 | 18.165 |
| Sample_049 | Q8WY21;Q8WY21-2;Q8WY21-3;Q8WY21-4 | 15.525 |
| Sample_182 | Q8NFT8 | 14.379 |
| Sample_123 | Q16853;Q16853-2 | 14.504 |
12600 rows × 1 columns
| observed | |
|---|---|
| count | 12,600.000 |
| mean | 16.339 |
| std | 2.741 |
| min | 7.209 |
| 25% | 14.412 |
| 50% | 15.935 |
| 75% | 17.910 |
| max | 30.140 |
Data in wide format#
| protein groups | A0A024QZX5;A0A087X1N8;P35237 | A0A024R0T9;K7ER74;P02655 | A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 | A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 | A0A075B6H7 | A0A075B6H9 | A0A075B6I0 | A0A075B6I1 | A0A075B6I6 | A0A075B6I9 | ... | Q9Y653;Q9Y653-2;Q9Y653-3 | Q9Y696 | Q9Y6C2 | Q9Y6N6 | Q9Y6N7;Q9Y6N7-2;Q9Y6N7-4 | Q9Y6R7 | Q9Y6X5 | Q9Y6Y8;Q9Y6Y8-2 | Q9Y6Y9 | S4R3U6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||||||||||||
| Sample_000 | 15.912 | 16.852 | 15.570 | 16.481 | 17.301 | 20.246 | 16.764 | 17.584 | 16.988 | 20.054 | ... | 16.012 | 15.178 | NaN | 15.050 | 16.842 | NaN | NaN | 19.563 | NaN | 12.805 |
| Sample_001 | NaN | 16.874 | 15.519 | 16.387 | NaN | 19.941 | 18.786 | 17.144 | NaN | 19.067 | ... | 15.528 | 15.576 | NaN | 14.833 | 16.597 | 20.299 | 15.556 | 19.386 | 13.970 | 12.442 |
| Sample_002 | 16.111 | NaN | 15.935 | 16.416 | 18.175 | 19.251 | 16.832 | 15.671 | 17.012 | 18.569 | ... | 15.229 | 14.728 | 13.757 | 15.118 | 17.440 | 19.598 | 15.735 | 20.447 | 12.636 | 12.505 |
| Sample_003 | 16.107 | 17.032 | 15.802 | 16.979 | 15.963 | 19.628 | 17.852 | 18.877 | 14.182 | 18.985 | ... | 15.495 | 14.590 | 14.682 | 15.140 | 17.356 | 19.429 | NaN | 20.216 | NaN | 12.445 |
| Sample_004 | 15.603 | 15.331 | 15.375 | 16.679 | NaN | 20.450 | 18.682 | 17.081 | 14.140 | 19.686 | ... | 14.757 | NaN | NaN | 15.256 | 17.075 | 19.582 | 15.328 | NaN | 13.145 | NaN |
5 rows × 1421 columns
Train#
model = ‘sklearn_knn’
Predictions#
data of training data set and validation dataset to create predictions is the same as training data.
predictions include missing values (which are not further compared)
create predictions and select for split entries
Sample ID protein groups
Sample_000 A0A024QZX5;A0A087X1N8;P35237 15.912
A0A024R0T9;K7ER74;P02655 16.852
A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 15.570
A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 16.481
A0A075B6H7 17.301
...
Sample_209 Q9Y6R7 19.275
Q9Y6X5 15.732
Q9Y6Y8;Q9Y6Y8-2 19.577
Q9Y6Y9 11.042
S4R3U6 11.791
Length: 298410, dtype: float64
| observed | KNN5 | ||
|---|---|---|---|
| Sample ID | protein groups | ||
| Sample_158 | Q9UN70;Q9UN70-2 | 14.630 | 15.449 |
| Sample_050 | Q9Y287 | 15.755 | 17.314 |
| Sample_107 | Q8N475;Q8N475-2 | 15.029 | 14.355 |
| Sample_199 | P06307 | 19.376 | 19.385 |
| Sample_067 | Q5VUB5 | 15.309 | 15.040 |
| ... | ... | ... | ... |
| Sample_111 | F6SYF8;Q9UBP4 | 22.822 | 22.899 |
| Sample_002 | A0A0A0MT36 | 18.165 | 16.142 |
| Sample_049 | Q8WY21;Q8WY21-2;Q8WY21-3;Q8WY21-4 | 15.525 | 15.574 |
| Sample_182 | Q8NFT8 | 14.379 | 13.480 |
| Sample_123 | Q16853;Q16853-2 | 14.504 | 14.627 |
12600 rows × 2 columns
| observed | KNN5 | ||
|---|---|---|---|
| Sample ID | protein groups | ||
| Sample_000 | A0A075B6P5;P01615 | 17.016 | 17.207 |
| A0A087X089;Q16627;Q16627-2 | 18.280 | 18.146 | |
| A0A0B4J2B5;S4R460 | 21.735 | 21.959 | |
| A0A140T971;O95865;Q5SRR8;Q5SSV3 | 14.603 | 15.143 | |
| A0A140TA33;A0A140TA41;A0A140TA52;P22105;P22105-3;P22105-4 | 16.143 | 16.743 | |
| ... | ... | ... | ... |
| Sample_209 | Q96ID5 | 16.074 | 15.981 |
| Q9H492;Q9H492-2 | 13.173 | 13.432 | |
| Q9HC57 | 14.207 | 14.131 | |
| Q9NPH3;Q9NPH3-2;Q9NPH3-5 | 14.962 | 15.123 | |
| Q9UGM5;Q9UGM5-2 | 16.871 | 16.378 |
12600 rows × 2 columns
save missing values predictions
Sample ID protein groups
Sample_000 A0A075B6J9 15.591
A0A075B6Q5 15.915
A0A075B6R2 16.857
A0A075B6S5 16.192
A0A087WSY4 16.490
...
Sample_209 Q9P1W8;Q9P1W8-2;Q9P1W8-4 15.979
Q9UI40;Q9UI40-2 16.704
Q9UIW2 17.246
Q9UMX0;Q9UMX0-2;Q9UMX0-4 12.989
Q9UP79 15.647
Name: intensity, Length: 46401, dtype: float64
Plots#
validation data
Comparisons#
Validation data#
all measured (identified, observed) peptides in validation data
Does not make to much sense to compare collab and AEs, as the setup differs of training and validation data differs
The fake NA for the validation step are real test data (not used for training nor early stopping)
Selected as truth to compare to: observed
{'KNN5': {'MSE': 0.5165599266763434,
'MAE': 0.4669947218139098,
'N': 12600,
'prop': 1.0}}
Test Datasplit#
Fake NAs : Artificially created NAs. Some data was sampled and set explicitly to misssing before it was fed to the model for reconstruction.
Selected as truth to compare to: observed
{'KNN5': {'MSE': 0.5179350572570403,
'MAE': 0.46923050106832126,
'N': 12600,
'prop': 1.0}}
Save all metrics as json
{ 'test_fake_na': { 'KNN5': { 'MAE': 0.46923050106832126,
'MSE': 0.5179350572570403,
'N': 12600,
'prop': 1.0}},
'valid_fake_na': { 'KNN5': { 'MAE': 0.4669947218139098,
'MSE': 0.5165599266763434,
'N': 12600,
'prop': 1.0}}}
| subset | valid_fake_na | test_fake_na | |
|---|---|---|---|
| model | metric_name | ||
| KNN5 | MSE | 0.517 | 0.518 |
| MAE | 0.467 | 0.469 | |
| N | 12,600.000 | 12,600.000 | |
| prop | 1.000 | 1.000 |
Save predictions#
Config#
{}
{'M': 1421,
'batch_size': 64,
'cuda': True,
'data': Path('runs/alzheimer_study/data'),
'epochs_max': 50,
'file_format': 'csv',
'fn_rawfile_metadata': 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv',
'folder_data': '',
'folder_experiment': Path('runs/alzheimer_study'),
'force_train': True,
'meta_cat_col': None,
'meta_date_col': None,
'model': 'KNN',
'model_key': 'KNN5',
'n_params': 1,
'neighbors': 5,
'out_figures': Path('runs/alzheimer_study/figures'),
'out_folder': Path('runs/alzheimer_study'),
'out_metrics': Path('runs/alzheimer_study'),
'out_models': Path('runs/alzheimer_study'),
'out_preds': Path('runs/alzheimer_study/preds'),
'sample_idx_position': 0,
'save_pred_real_na': True}