Variational Autoencoder#
pimmslearn - INFO Experiment 03 - Analysis of latent spaces and performance comparisions
Papermill script parameters:
# files and folders
# Datasplit folder with data for experiment
folder_experiment: str = 'runs/example'
folder_data: str = '' # specify data directory if needed
file_format: str = 'csv' # file format of create splits, default pickle (pkl)
# Machine parsed metadata from rawfile workflow
fn_rawfile_metadata: str = 'data/dev_datasets/HeLa_6070/files_selected_metadata_N50.csv'
# training
epochs_max: int = 50 # Maximum number of epochs
batch_size: int = 64 # Batch size for training (and evaluation)
cuda: bool = True # Whether to use a GPU for training
# model
# Dimensionality of encoding dimension (latent space of model)
latent_dim: int = 25
# A underscore separated string of layers, '256_128' for the encoder, reverse will be use for decoder
hidden_layers: str = '256_128'
# force_train:bool = True # Force training when saved model could be used. Per default re-train model
patience: int = 50 # Patience for early stopping
sample_idx_position: int = 0 # position of index which is sample ID
model: str = 'VAE' # model name
model_key: str = 'VAE' # potentially alternative key for model (grid search)
save_pred_real_na: bool = True # Save all predictions for missing values
# metadata -> defaults for metadata extracted from machine data
meta_date_col: str = None # date column in meta data
meta_cat_col: str = None # category column in meta data
# Parameters
model = "VAE"
latent_dim = 10
batch_size = 64
epochs_max = 300
hidden_layers = "64"
sample_idx_position = 0
cuda = False
save_pred_real_na = True
fn_rawfile_metadata = "https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv"
folder_experiment = "runs/alzheimer_study"
model_key = "VAE"
Some argument transformations
{'folder_experiment': 'runs/alzheimer_study',
'folder_data': '',
'file_format': 'csv',
'fn_rawfile_metadata': 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv',
'epochs_max': 300,
'batch_size': 64,
'cuda': False,
'latent_dim': 10,
'hidden_layers': '64',
'patience': 50,
'sample_idx_position': 0,
'model': 'VAE',
'model_key': 'VAE',
'save_pred_real_na': True,
'meta_date_col': None,
'meta_cat_col': None}
{'batch_size': 64,
'cuda': False,
'data': Path('runs/alzheimer_study/data'),
'epochs_max': 300,
'file_format': 'csv',
'fn_rawfile_metadata': 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv',
'folder_data': '',
'folder_experiment': Path('runs/alzheimer_study'),
'hidden_layers': [64],
'latent_dim': 10,
'meta_cat_col': None,
'meta_date_col': None,
'model': 'VAE',
'model_key': 'VAE',
'out_figures': Path('runs/alzheimer_study/figures'),
'out_folder': Path('runs/alzheimer_study'),
'out_metrics': Path('runs/alzheimer_study'),
'out_models': Path('runs/alzheimer_study'),
'out_preds': Path('runs/alzheimer_study/preds'),
'patience': 50,
'sample_idx_position': 0,
'save_pred_real_na': True}
Some naming conventions
Load data in long format#
pimmslearn.io.datasplits - INFO Loaded 'train_X' from file: runs/alzheimer_study/data/train_X.csv
pimmslearn.io.datasplits - INFO Loaded 'val_y' from file: runs/alzheimer_study/data/val_y.csv
pimmslearn.io.datasplits - INFO Loaded 'test_y' from file: runs/alzheimer_study/data/test_y.csv
data is loaded in long format
Sample ID protein groups
Sample_008 P18669 17.218
Sample_065 I3L0N3;P46459 15.545
Sample_082 Q06828 17.710
Sample_103 Q02487;Q02487-2 17.373
Sample_157 Q9P0K9 16.817
Name: intensity, dtype: float64
Infer index names from long format
pimmslearn - INFO sample_id = 'Sample ID', single feature: index_column = 'protein groups'
load meta data for splits
| _collection site | _age at CSF collection | _gender | _t-tau [ng/L] | _p-tau [ng/L] | _Abeta-42 [ng/L] | _Abeta-40 [ng/L] | _Abeta-42/Abeta-40 ratio | _primary biochemical AD classification | _clinical AD diagnosis | _MMSE score | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||
| Sample_000 | Sweden | 71.000 | f | 703.000 | 85.000 | 562.000 | NaN | NaN | biochemical control | NaN | NaN |
| Sample_001 | Sweden | 77.000 | m | 518.000 | 91.000 | 334.000 | NaN | NaN | biochemical AD | NaN | NaN |
| Sample_002 | Sweden | 75.000 | m | 974.000 | 87.000 | 515.000 | NaN | NaN | biochemical AD | NaN | NaN |
| Sample_003 | Sweden | 72.000 | f | 950.000 | 109.000 | 394.000 | NaN | NaN | biochemical AD | NaN | NaN |
| Sample_004 | Sweden | 63.000 | f | 873.000 | 88.000 | 234.000 | NaN | NaN | biochemical AD | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Sample_205 | Berlin | 69.000 | f | 1,945.000 | NaN | 699.000 | 12,140.000 | 0.058 | biochemical AD | AD | 17.000 |
| Sample_206 | Berlin | 73.000 | m | 299.000 | NaN | 1,420.000 | 16,571.000 | 0.086 | biochemical control | non-AD | 28.000 |
| Sample_207 | Berlin | 71.000 | f | 262.000 | NaN | 639.000 | 9,663.000 | 0.066 | biochemical control | non-AD | 28.000 |
| Sample_208 | Berlin | 83.000 | m | 289.000 | NaN | 1,436.000 | 11,285.000 | 0.127 | biochemical control | non-AD | 24.000 |
| Sample_209 | Berlin | 63.000 | f | 591.000 | NaN | 1,299.000 | 11,232.000 | 0.116 | biochemical control | non-AD | 29.000 |
210 rows × 11 columns
Initialize Comparison#
replicates idea for truely missing values: Define truth as by using n=3 replicates to impute each sample
real test data:
Not used for predictions or early stopping.
[x] add some additional NAs based on distribution of data
protein groups
A0A024QZX5;A0A087X1N8;P35237 197
A0A024R0T9;K7ER74;P02655 208
A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 185
A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 208
A0A075B6H7 97
Name: freq, dtype: int64
Produce some addional simulated samples#
The validation simulated NA is used to by all models to evaluate training performance.
| observed | ||
|---|---|---|
| Sample ID | protein groups | |
| Sample_158 | Q9UN70;Q9UN70-2 | 14.630 |
| Sample_050 | Q9Y287 | 15.755 |
| Sample_107 | Q8N475;Q8N475-2 | 15.029 |
| Sample_199 | P06307 | 19.376 |
| Sample_067 | Q5VUB5 | 15.309 |
| ... | ... | ... |
| Sample_111 | F6SYF8;Q9UBP4 | 22.822 |
| Sample_002 | A0A0A0MT36 | 18.165 |
| Sample_049 | Q8WY21;Q8WY21-2;Q8WY21-3;Q8WY21-4 | 15.525 |
| Sample_182 | Q8NFT8 | 14.379 |
| Sample_123 | Q16853;Q16853-2 | 14.504 |
12600 rows × 1 columns
| observed | |
|---|---|
| count | 12,600.000 |
| mean | 16.339 |
| std | 2.741 |
| min | 7.209 |
| 25% | 14.412 |
| 50% | 15.935 |
| 75% | 17.910 |
| max | 30.140 |
Data in wide format#
Autoencoder need data in wide format
| protein groups | A0A024QZX5;A0A087X1N8;P35237 | A0A024R0T9;K7ER74;P02655 | A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 | A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 | A0A075B6H7 | A0A075B6H9 | A0A075B6I0 | A0A075B6I1 | A0A075B6I6 | A0A075B6I9 | ... | Q9Y653;Q9Y653-2;Q9Y653-3 | Q9Y696 | Q9Y6C2 | Q9Y6N6 | Q9Y6N7;Q9Y6N7-2;Q9Y6N7-4 | Q9Y6R7 | Q9Y6X5 | Q9Y6Y8;Q9Y6Y8-2 | Q9Y6Y9 | S4R3U6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||||||||||||
| Sample_000 | 15.912 | 16.852 | 15.570 | 16.481 | 17.301 | 20.246 | 16.764 | 17.584 | 16.988 | 20.054 | ... | 16.012 | 15.178 | NaN | 15.050 | 16.842 | NaN | NaN | 19.563 | NaN | 12.805 |
| Sample_001 | NaN | 16.874 | 15.519 | 16.387 | NaN | 19.941 | 18.786 | 17.144 | NaN | 19.067 | ... | 15.528 | 15.576 | NaN | 14.833 | 16.597 | 20.299 | 15.556 | 19.386 | 13.970 | 12.442 |
| Sample_002 | 16.111 | NaN | 15.935 | 16.416 | 18.175 | 19.251 | 16.832 | 15.671 | 17.012 | 18.569 | ... | 15.229 | 14.728 | 13.757 | 15.118 | 17.440 | 19.598 | 15.735 | 20.447 | 12.636 | 12.505 |
| Sample_003 | 16.107 | 17.032 | 15.802 | 16.979 | 15.963 | 19.628 | 17.852 | 18.877 | 14.182 | 18.985 | ... | 15.495 | 14.590 | 14.682 | 15.140 | 17.356 | 19.429 | NaN | 20.216 | NaN | 12.445 |
| Sample_004 | 15.603 | 15.331 | 15.375 | 16.679 | NaN | 20.450 | 18.682 | 17.081 | 14.140 | 19.686 | ... | 14.757 | NaN | NaN | 15.256 | 17.075 | 19.582 | 15.328 | NaN | 13.145 | NaN |
5 rows × 1421 columns
Add interpolation performance#
Fill Validation data with potentially missing features#
| protein groups | A0A024QZX5;A0A087X1N8;P35237 | A0A024R0T9;K7ER74;P02655 | A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 | A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 | A0A075B6H7 | A0A075B6H9 | A0A075B6I0 | A0A075B6I1 | A0A075B6I6 | A0A075B6I9 | ... | Q9Y653;Q9Y653-2;Q9Y653-3 | Q9Y696 | Q9Y6C2 | Q9Y6N6 | Q9Y6N7;Q9Y6N7-2;Q9Y6N7-4 | Q9Y6R7 | Q9Y6X5 | Q9Y6Y8;Q9Y6Y8-2 | Q9Y6Y9 | S4R3U6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||||||||||||
| Sample_000 | 15.912 | 16.852 | 15.570 | 16.481 | 17.301 | 20.246 | 16.764 | 17.584 | 16.988 | 20.054 | ... | 16.012 | 15.178 | NaN | 15.050 | 16.842 | NaN | NaN | 19.563 | NaN | 12.805 |
| Sample_001 | NaN | 16.874 | 15.519 | 16.387 | NaN | 19.941 | 18.786 | 17.144 | NaN | 19.067 | ... | 15.528 | 15.576 | NaN | 14.833 | 16.597 | 20.299 | 15.556 | 19.386 | 13.970 | 12.442 |
| Sample_002 | 16.111 | NaN | 15.935 | 16.416 | 18.175 | 19.251 | 16.832 | 15.671 | 17.012 | 18.569 | ... | 15.229 | 14.728 | 13.757 | 15.118 | 17.440 | 19.598 | 15.735 | 20.447 | 12.636 | 12.505 |
| Sample_003 | 16.107 | 17.032 | 15.802 | 16.979 | 15.963 | 19.628 | 17.852 | 18.877 | 14.182 | 18.985 | ... | 15.495 | 14.590 | 14.682 | 15.140 | 17.356 | 19.429 | NaN | 20.216 | NaN | 12.445 |
| Sample_004 | 15.603 | 15.331 | 15.375 | 16.679 | NaN | 20.450 | 18.682 | 17.081 | 14.140 | 19.686 | ... | 14.757 | NaN | NaN | 15.256 | 17.075 | 19.582 | 15.328 | NaN | 13.145 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Sample_205 | 15.682 | 16.886 | 14.910 | 16.482 | NaN | 17.705 | 17.039 | NaN | 16.413 | 19.102 | ... | NaN | 15.684 | 14.236 | 15.415 | 17.551 | 17.922 | 16.340 | 19.928 | 12.929 | NaN |
| Sample_206 | 15.798 | 17.554 | 15.600 | 15.938 | NaN | 18.154 | 18.152 | 16.503 | 16.860 | 18.538 | ... | 15.422 | 16.106 | NaN | 15.345 | 17.084 | 18.708 | NaN | 19.433 | NaN | NaN |
| Sample_207 | 15.739 | NaN | 15.469 | 16.898 | NaN | 18.636 | 17.950 | 16.321 | 16.401 | 18.849 | ... | 15.808 | 16.098 | 14.403 | 15.715 | NaN | 18.725 | 16.138 | 19.599 | 13.637 | 11.174 |
| Sample_208 | 15.477 | 16.779 | 14.995 | 16.132 | NaN | 14.908 | NaN | NaN | 16.119 | 18.368 | ... | 15.157 | 16.712 | NaN | 14.640 | 16.533 | 19.411 | 15.807 | 19.545 | NaN | NaN |
| Sample_209 | NaN | 17.261 | 15.175 | 16.235 | NaN | 17.893 | 17.744 | 16.371 | 15.780 | 18.806 | ... | 15.237 | 15.652 | 15.211 | 14.205 | 16.749 | 19.275 | 15.732 | 19.577 | 11.042 | 11.791 |
210 rows × 1421 columns
| protein groups | A0A024QZX5;A0A087X1N8;P35237 | A0A024R0T9;K7ER74;P02655 | A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 | A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 | A0A075B6H7 | A0A075B6H9 | A0A075B6I0 | A0A075B6I1 | A0A075B6I6 | A0A075B6I9 | ... | Q9Y653;Q9Y653-2;Q9Y653-3 | Q9Y696 | Q9Y6C2 | Q9Y6N6 | Q9Y6N7;Q9Y6N7-2;Q9Y6N7-4 | Q9Y6R7 | Q9Y6X5 | Q9Y6Y8;Q9Y6Y8-2 | Q9Y6Y9 | S4R3U6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||||||||||||
| Sample_000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 19.863 | NaN | NaN | NaN | NaN |
| Sample_001 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_002 | NaN | 14.523 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_003 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_004 | NaN | NaN | NaN | NaN | 15.473 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 14.048 | NaN | NaN | NaN | NaN | 19.867 | NaN | 12.235 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Sample_205 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 11.802 |
| Sample_206 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_207 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_208 | NaN | NaN | NaN | NaN | NaN | NaN | 17.530 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_209 | 15.727 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
210 rows × 1419 columns
| protein groups | A0A024QZX5;A0A087X1N8;P35237 | A0A024R0T9;K7ER74;P02655 | A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 | A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 | A0A075B6H7 | A0A075B6H9 | A0A075B6I0 | A0A075B6I1 | A0A075B6I6 | A0A075B6I9 | ... | Q9Y653;Q9Y653-2;Q9Y653-3 | Q9Y696 | Q9Y6C2 | Q9Y6N6 | Q9Y6N7;Q9Y6N7-2;Q9Y6N7-4 | Q9Y6R7 | Q9Y6X5 | Q9Y6Y8;Q9Y6Y8-2 | Q9Y6Y9 | S4R3U6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||||||||||||
| Sample_000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 19.863 | NaN | NaN | NaN | NaN |
| Sample_001 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_002 | NaN | 14.523 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_003 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_004 | NaN | NaN | NaN | NaN | 15.473 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 14.048 | NaN | NaN | NaN | NaN | 19.867 | NaN | 12.235 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Sample_205 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 11.802 |
| Sample_206 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_207 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_208 | NaN | NaN | NaN | NaN | NaN | NaN | 17.530 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_209 | 15.727 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
210 rows × 1421 columns
Variational Autoencoder#
Analysis: DataLoaders, Model, transform#
Analysis: DataLoaders, Model#
VAE(
(encoder): Sequential(
(0): Linear(in_features=1421, out_features=64, bias=True)
(1): Dropout(p=0.2, inplace=False)
(2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=0.1)
(4): Linear(in_features=64, out_features=20, bias=True)
)
(decoder): Sequential(
(0): Linear(in_features=10, out_features=64, bias=True)
(1): Dropout(p=0.2, inplace=False)
(2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=0.1)
(4): Linear(in_features=64, out_features=2842, bias=True)
)
)
Training#
Start Fit
- before_fit : [TrainEvalCallback, Recorder, ProgressCallback, EarlyStoppingCallback]
Start Epoch Loop
- before_epoch : [Recorder, ProgressCallback]
Start Train
- before_train : [TrainEvalCallback, Recorder, ProgressCallback]
Start Batch Loop
- before_batch : [ModelAdapterVAE, CastToTensor]
- after_pred : [ModelAdapterVAE]
- after_loss : [ModelAdapterVAE]
- before_backward: []
- before_step : []
- after_step : []
- after_cancel_batch: []
- after_batch : [TrainEvalCallback, Recorder, ProgressCallback]
End Batch Loop
End Train
- after_cancel_train: [Recorder]
- after_train : [Recorder, ProgressCallback]
Start Valid
- before_validate: [TrainEvalCallback, Recorder, ProgressCallback]
Start Batch Loop
- **CBs same as train batch**: []
End Batch Loop
End Valid
- after_cancel_validate: [Recorder]
- after_validate : [Recorder, ProgressCallback]
End Epoch Loop
- after_cancel_epoch: []
- after_epoch : [Recorder, EarlyStoppingCallback]
End Fit
- after_cancel_fit: []
- after_fit : [ProgressCallback, EarlyStoppingCallback]
Adding a EarlyStoppingCallback results in an error. Potential fix in
PR3509 is not yet in
current version. Try again later
SuggestedLRs(valley=0.004365158267319202)
dump model config
| epoch | train_loss | valid_loss | time |
|---|---|---|---|
| 0 | 1681.260010 | 92.933205 | 00:00 |
| 1 | 1677.450073 | 93.197266 | 00:00 |
| 2 | 1676.770386 | 93.968567 | 00:00 |
| 3 | 1670.359009 | 94.387527 | 00:00 |
| 4 | 1667.820068 | 94.632935 | 00:00 |
| 5 | 1665.179565 | 95.508972 | 00:00 |
| 6 | 1661.601318 | 95.151268 | 00:00 |
| 7 | 1658.818848 | 95.213013 | 00:00 |
| 8 | 1653.829956 | 95.275299 | 00:00 |
| 9 | 1651.753784 | 95.326698 | 00:00 |
| 10 | 1647.774902 | 95.325226 | 00:00 |
| 11 | 1642.565918 | 95.175858 | 00:00 |
| 12 | 1636.571167 | 94.771996 | 00:00 |
| 13 | 1630.033936 | 94.969894 | 00:00 |
| 14 | 1623.838257 | 94.011261 | 00:00 |
| 15 | 1617.940063 | 93.753952 | 00:00 |
| 16 | 1611.727295 | 93.625000 | 00:00 |
| 17 | 1605.685059 | 93.076286 | 00:00 |
| 18 | 1596.703369 | 92.917961 | 00:00 |
| 19 | 1588.318726 | 92.697433 | 00:00 |
| 20 | 1580.251587 | 92.167198 | 00:00 |
| 21 | 1572.025146 | 92.470650 | 00:00 |
| 22 | 1562.621338 | 92.372658 | 00:00 |
| 23 | 1552.168335 | 92.248528 | 00:00 |
| 24 | 1541.813965 | 92.515396 | 00:00 |
| 25 | 1530.719360 | 92.725372 | 00:00 |
| 26 | 1519.046997 | 92.691254 | 00:00 |
| 27 | 1507.603760 | 92.719414 | 00:00 |
| 28 | 1497.102783 | 92.570152 | 00:00 |
| 29 | 1486.316162 | 92.058990 | 00:00 |
| 30 | 1474.222412 | 92.939224 | 00:00 |
| 31 | 1462.927124 | 92.705879 | 00:00 |
| 32 | 1451.918091 | 93.054527 | 00:00 |
| 33 | 1439.776855 | 93.495575 | 00:00 |
| 34 | 1430.229126 | 94.568779 | 00:00 |
| 35 | 1419.189209 | 94.725548 | 00:00 |
| 36 | 1409.200684 | 94.876076 | 00:00 |
| 37 | 1399.114990 | 95.296013 | 00:00 |
| 38 | 1389.443115 | 94.942818 | 00:00 |
| 39 | 1379.382324 | 95.005264 | 00:00 |
| 40 | 1369.353149 | 94.872528 | 00:00 |
| 41 | 1359.266724 | 94.546623 | 00:00 |
| 42 | 1349.156860 | 93.559479 | 00:00 |
| 43 | 1339.177612 | 93.103928 | 00:00 |
| 44 | 1329.869629 | 93.074844 | 00:00 |
| 45 | 1321.256836 | 92.431213 | 00:00 |
| 46 | 1312.543579 | 91.778519 | 00:00 |
| 47 | 1305.022339 | 91.950966 | 00:00 |
| 48 | 1297.779053 | 92.367325 | 00:00 |
| 49 | 1290.531250 | 91.971901 | 00:00 |
| 50 | 1283.642578 | 92.359161 | 00:00 |
| 51 | 1274.620239 | 92.333214 | 00:00 |
| 52 | 1269.216431 | 91.942368 | 00:00 |
| 53 | 1263.001709 | 91.799011 | 00:00 |
| 54 | 1256.208618 | 92.077408 | 00:00 |
| 55 | 1249.145752 | 91.962097 | 00:00 |
| 56 | 1241.852783 | 91.425781 | 00:00 |
| 57 | 1236.193115 | 91.663345 | 00:00 |
| 58 | 1229.466431 | 91.787544 | 00:00 |
| 59 | 1222.492676 | 91.886932 | 00:00 |
| 60 | 1218.031860 | 91.858109 | 00:00 |
| 61 | 1211.416748 | 92.664055 | 00:00 |
| 62 | 1205.817993 | 92.675346 | 00:00 |
| 63 | 1200.116577 | 91.576729 | 00:00 |
| 64 | 1194.683838 | 92.051529 | 00:00 |
| 65 | 1189.678955 | 92.148064 | 00:00 |
| 66 | 1183.853394 | 91.904091 | 00:00 |
| 67 | 1180.686523 | 92.228996 | 00:00 |
| 68 | 1176.490967 | 92.599709 | 00:00 |
| 69 | 1173.830688 | 92.476151 | 00:00 |
| 70 | 1169.118774 | 92.980522 | 00:00 |
| 71 | 1166.697388 | 93.198235 | 00:00 |
| 72 | 1162.679932 | 92.535194 | 00:00 |
| 73 | 1159.002075 | 91.792870 | 00:00 |
| 74 | 1154.085449 | 91.593491 | 00:00 |
| 75 | 1151.417847 | 91.624886 | 00:00 |
| 76 | 1149.590942 | 91.542877 | 00:00 |
| 77 | 1145.024048 | 92.156807 | 00:00 |
| 78 | 1141.539062 | 92.733093 | 00:00 |
| 79 | 1138.490234 | 92.969772 | 00:00 |
| 80 | 1135.895508 | 92.002029 | 00:00 |
| 81 | 1131.641235 | 92.181732 | 00:00 |
| 82 | 1129.402344 | 92.788414 | 00:00 |
| 83 | 1126.167969 | 92.456902 | 00:00 |
| 84 | 1122.417114 | 91.888458 | 00:00 |
| 85 | 1118.484863 | 91.517410 | 00:00 |
| 86 | 1116.457397 | 91.753662 | 00:00 |
| 87 | 1113.608765 | 92.899895 | 00:00 |
| 88 | 1111.304443 | 92.329819 | 00:00 |
| 89 | 1108.042114 | 92.442894 | 00:00 |
| 90 | 1106.831543 | 92.127449 | 00:00 |
| 91 | 1104.108521 | 92.666824 | 00:00 |
| 92 | 1103.404541 | 92.222954 | 00:00 |
| 93 | 1101.389160 | 92.531891 | 00:00 |
| 94 | 1098.035156 | 92.592186 | 00:00 |
| 95 | 1095.347412 | 92.046753 | 00:00 |
| 96 | 1092.231445 | 91.951965 | 00:00 |
| 97 | 1089.109741 | 92.109200 | 00:00 |
| 98 | 1087.489258 | 92.385918 | 00:00 |
| 99 | 1085.430420 | 91.938026 | 00:00 |
| 100 | 1084.357056 | 92.421646 | 00:00 |
| 101 | 1082.756592 | 92.208168 | 00:00 |
| 102 | 1080.615234 | 91.822823 | 00:00 |
| 103 | 1081.621948 | 92.788315 | 00:00 |
| 104 | 1080.165771 | 92.481972 | 00:00 |
| 105 | 1078.049072 | 92.701241 | 00:00 |
| 106 | 1076.205933 | 92.097954 | 00:00 |
No improvement since epoch 56: early stopping
Save number of actually trained epochs
107
Loss normalized by total number of measurements#
pimmslearn.plotting - INFO Saved Figures to runs/alzheimer_study/figures/vae_training
Predictions#
create predictions and select validation data predictions
Sample ID protein groups
Sample_000 A0A024QZX5;A0A087X1N8;P35237 15.896
A0A024R0T9;K7ER74;P02655 16.885
A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 15.779
A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 16.635
A0A075B6H7 16.771
...
Sample_209 Q9Y6R7 18.995
Q9Y6X5 15.681
Q9Y6Y8;Q9Y6Y8-2 19.424
Q9Y6Y9 11.946
S4R3U6 11.373
Length: 298410, dtype: float32
| observed | VAE | ||
|---|---|---|---|
| Sample ID | protein groups | ||
| Sample_158 | Q9UN70;Q9UN70-2 | 14.630 | 15.716 |
| Sample_050 | Q9Y287 | 15.755 | 16.778 |
| Sample_107 | Q8N475;Q8N475-2 | 15.029 | 14.275 |
| Sample_199 | P06307 | 19.376 | 19.127 |
| Sample_067 | Q5VUB5 | 15.309 | 14.968 |
| ... | ... | ... | ... |
| Sample_111 | F6SYF8;Q9UBP4 | 22.822 | 22.869 |
| Sample_002 | A0A0A0MT36 | 18.165 | 16.131 |
| Sample_049 | Q8WY21;Q8WY21-2;Q8WY21-3;Q8WY21-4 | 15.525 | 15.699 |
| Sample_182 | Q8NFT8 | 14.379 | 13.283 |
| Sample_123 | Q16853;Q16853-2 | 14.504 | 14.566 |
12600 rows × 2 columns
| observed | VAE | ||
|---|---|---|---|
| Sample ID | protein groups | ||
| Sample_000 | A0A075B6P5;P01615 | 17.016 | 17.433 |
| A0A087X089;Q16627;Q16627-2 | 18.280 | 17.838 | |
| A0A0B4J2B5;S4R460 | 21.735 | 22.173 | |
| A0A140T971;O95865;Q5SRR8;Q5SSV3 | 14.603 | 15.235 | |
| A0A140TA33;A0A140TA41;A0A140TA52;P22105;P22105-3;P22105-4 | 16.143 | 16.620 | |
| ... | ... | ... | ... |
| Sample_209 | Q96ID5 | 16.074 | 16.063 |
| Q9H492;Q9H492-2 | 13.173 | 13.373 | |
| Q9HC57 | 14.207 | 14.448 | |
| Q9NPH3;Q9NPH3-2;Q9NPH3-5 | 14.962 | 15.174 | |
| Q9UGM5;Q9UGM5-2 | 16.871 | 16.583 |
12600 rows × 2 columns
save missing values predictions
Sample ID protein groups
Sample_000 A0A075B6J9 15.967
A0A075B6Q5 15.942
A0A075B6R2 16.766
A0A075B6S5 16.407
A0A087WSY4 16.510
...
Sample_209 Q9P1W8;Q9P1W8-2;Q9P1W8-4 16.138
Q9UI40;Q9UI40-2 16.397
Q9UIW2 16.973
Q9UMX0;Q9UMX0-2;Q9UMX0-4 13.600
Q9UP79 15.994
Name: intensity, Length: 46401, dtype: float32
Plots#
validation data
| latent dimension 1 | latent dimension 2 | latent dimension 3 | latent dimension 4 | latent dimension 5 | latent dimension 6 | latent dimension 7 | latent dimension 8 | latent dimension 9 | latent dimension 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | ||||||||||
| Sample_000 | 0.945 | 0.319 | 2.111 | -1.097 | 0.952 | 0.570 | -0.192 | 0.714 | -0.092 | 2.170 |
| Sample_001 | -0.018 | -0.527 | 1.519 | -2.022 | 0.367 | -0.017 | -1.011 | -0.641 | 0.307 | 1.101 |
| Sample_002 | -0.541 | 0.112 | 0.435 | -0.732 | 2.837 | -1.790 | 0.034 | 0.758 | 0.677 | 1.809 |
| Sample_003 | 0.129 | -0.480 | 2.092 | -0.557 | 1.759 | -0.191 | 0.301 | 1.337 | 0.475 | 1.848 |
| Sample_004 | -0.124 | 0.422 | 2.023 | -1.432 | 0.810 | -0.590 | -0.300 | 0.389 | 0.296 | 1.267 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Sample_205 | -1.536 | 0.567 | 0.713 | 0.780 | -0.896 | -1.529 | 0.909 | 1.067 | -0.728 | 1.864 |
| Sample_206 | 0.189 | 0.632 | -2.480 | -0.475 | -0.803 | 0.649 | -1.662 | 0.682 | 0.512 | 1.788 |
| Sample_207 | 1.316 | 1.313 | 0.560 | 0.870 | -1.264 | -2.113 | -1.476 | -0.019 | 0.937 | 2.552 |
| Sample_208 | 0.185 | 0.485 | -0.974 | -0.614 | -1.916 | -1.640 | -0.649 | 0.035 | 1.975 | 0.187 |
| Sample_209 | -0.382 | 0.076 | -1.376 | 0.494 | -0.018 | -2.021 | 0.106 | 0.859 | 2.447 | 0.230 |
210 rows × 10 columns
freq_val
1 12
2 18
3 50
4 82
5 108
Name: count, dtype: int64
| VAE | ||
|---|---|---|
| mean | count | |
| protein groups | ||
| A0A024QZX5;A0A087X1N8;P35237 | 0.137 | 7 |
| A0A024R0T9;K7ER74;P02655 | 1.248 | 4 |
| A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 | 0.242 | 9 |
| A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 | 0.312 | 6 |
| A0A075B6H7 | 0.659 | 6 |
| ... | ... | ... |
| Q9Y6R7 | 0.421 | 10 |
| Q9Y6X5 | 0.232 | 7 |
| Q9Y6Y8;Q9Y6Y8-2 | 0.347 | 9 |
| Q9Y6Y9 | 0.446 | 15 |
| S4R3U6 | 0.483 | 24 |
1419 rows × 2 columns
| VAE | ||
|---|---|---|
| Sample ID | protein groups | |
| Sample_158 | Q9UN70;Q9UN70-2 | 1.085 |
| Sample_050 | Q9Y287 | 1.023 |
| Sample_107 | Q8N475;Q8N475-2 | -0.754 |
| Sample_199 | P06307 | -0.249 |
| Sample_067 | Q5VUB5 | -0.341 |
| ... | ... | ... |
| Sample_111 | F6SYF8;Q9UBP4 | 0.047 |
| Sample_002 | A0A0A0MT36 | -2.034 |
| Sample_049 | Q8WY21;Q8WY21-2;Q8WY21-3;Q8WY21-4 | 0.173 |
| Sample_182 | Q8NFT8 | -1.096 |
| Sample_123 | Q16853;Q16853-2 | 0.062 |
12600 rows × 1 columns
Comparisons#
Simulated NAs : Artificially created NAs. Some data was sampled and set explicitly to misssing before it was fed to the model for reconstruction.
Validation data#
all measured (identified, observed) peptides in validation data
The simulated NA for the validation step are real test data (not used for training nor early stopping)
Selected as truth to compare to: observed
{'VAE': {'MSE': 0.4570371945045525,
'MAE': 0.42994737989145104,
'N': 12600,
'prop': 1.0}}
Test Datasplit#
Selected as truth to compare to: observed
{'VAE': {'MSE': 0.4769721212935629,
'MAE': 0.4353523445903286,
'N': 12600,
'prop': 1.0}}
Save all metrics as json
{ 'test_simulated_na': { 'VAE': { 'MAE': 0.4353523445903286,
'MSE': 0.4769721212935629,
'N': 12600,
'prop': 1.0}},
'valid_simulated_na': { 'VAE': { 'MAE': 0.42994737989145104,
'MSE': 0.4570371945045525,
'N': 12600,
'prop': 1.0}}}
| subset | valid_simulated_na | test_simulated_na | |
|---|---|---|---|
| model | metric_name | ||
| VAE | MSE | 0.457 | 0.477 |
| MAE | 0.430 | 0.435 | |
| N | 12,600.000 | 12,600.000 | |
| prop | 1.000 | 1.000 |
Save predictions#
Config#
{}
{'M': 1421,
'batch_size': 64,
'cuda': False,
'data': Path('runs/alzheimer_study/data'),
'epoch_trained': 107,
'epochs_max': 300,
'file_format': 'csv',
'fn_rawfile_metadata': 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv',
'folder_data': '',
'folder_experiment': Path('runs/alzheimer_study'),
'hidden_layers': [64],
'latent_dim': 10,
'meta_cat_col': None,
'meta_date_col': None,
'model': 'VAE',
'model_key': 'VAE',
'n_params': 277998,
'out_figures': Path('runs/alzheimer_study/figures'),
'out_folder': Path('runs/alzheimer_study'),
'out_metrics': Path('runs/alzheimer_study'),
'out_models': Path('runs/alzheimer_study'),
'out_preds': Path('runs/alzheimer_study/preds'),
'patience': 50,
'sample_idx_position': 0,
'save_pred_real_na': True}