Denoising Autoencoder#
pimmslearn - INFO Experiment 03 - Analysis of latent spaces and performance comparisions
Papermill script parameters:
# files and folders
# Datasplit folder with data for experiment
folder_experiment: str = 'runs/example'
folder_data: str = '' # specify data directory if needed
file_format: str = 'csv' # file format of create splits, default pickle (pkl)
# Machine parsed metadata from rawfile workflow
fn_rawfile_metadata: str = 'data/dev_datasets/HeLa_6070/files_selected_metadata_N50.csv'
# training
epochs_max: int = 50 # Maximum number of epochs
# early_stopping:bool = True # Wheather to use early stopping or not
patience: int = 25 # Patience for early stopping
batch_size: int = 64 # Batch size for training (and evaluation)
cuda: bool = True # Whether to use a GPU for training
# model
# Dimensionality of encoding dimension (latent space of model)
latent_dim: int = 25
# A underscore separated string of layers, '128_64' for the encoder, reverse will be use for decoder
hidden_layers: str = '512'
sample_idx_position: int = 0 # position of index which is sample ID
model: str = 'DAE' # model name
model_key: str = 'DAE' # potentially alternative key for model (grid search)
save_pred_real_na: bool = True # Save all predictions for missing values
# metadata -> defaults for metadata extracted from machine data
meta_date_col: str = None # date column in meta data
meta_cat_col: str = None # category column in meta data
# Parameters
model = "DAE"
latent_dim = 10
batch_size = 64
epochs_max = 300
hidden_layers = "64"
sample_idx_position = 0
cuda = False
save_pred_real_na = True
fn_rawfile_metadata = "https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv"
folder_experiment = "runs/alzheimer_study"
model_key = "DAE"
Some argument transformations
{'folder_experiment': 'runs/alzheimer_study',
'folder_data': '',
'file_format': 'csv',
'fn_rawfile_metadata': 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv',
'epochs_max': 300,
'patience': 25,
'batch_size': 64,
'cuda': False,
'latent_dim': 10,
'hidden_layers': '64',
'sample_idx_position': 0,
'model': 'DAE',
'model_key': 'DAE',
'save_pred_real_na': True,
'meta_date_col': None,
'meta_cat_col': None}
{'batch_size': 64,
'cuda': False,
'data': Path('runs/alzheimer_study/data'),
'epochs_max': 300,
'file_format': 'csv',
'fn_rawfile_metadata': 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv',
'folder_data': '',
'folder_experiment': Path('runs/alzheimer_study'),
'hidden_layers': [64],
'latent_dim': 10,
'meta_cat_col': None,
'meta_date_col': None,
'model': 'DAE',
'model_key': 'DAE',
'out_figures': Path('runs/alzheimer_study/figures'),
'out_folder': Path('runs/alzheimer_study'),
'out_metrics': Path('runs/alzheimer_study'),
'out_models': Path('runs/alzheimer_study'),
'out_preds': Path('runs/alzheimer_study/preds'),
'patience': 25,
'sample_idx_position': 0,
'save_pred_real_na': True}
Some naming conventions
Load data in long format#
pimmslearn.io.datasplits - INFO Loaded 'train_X' from file: runs/alzheimer_study/data/train_X.csv
pimmslearn.io.datasplits - INFO Loaded 'val_y' from file: runs/alzheimer_study/data/val_y.csv
pimmslearn.io.datasplits - INFO Loaded 'test_y' from file: runs/alzheimer_study/data/test_y.csv
data is loaded in long format
Sample ID protein groups
Sample_137 F8WE04;P04792 14.104
Sample_068 Q5SPY9;Q9NQX5 12.977
Sample_107 P01742 18.367
Sample_178 Q15113 19.910
Sample_069 A0A0U1RQC5 19.182
Name: intensity, dtype: float64
Infer index names from long format
pimmslearn - INFO sample_id = 'Sample ID', single feature: index_column = 'protein groups'
load meta data for splits
| _collection site | _age at CSF collection | _gender | _t-tau [ng/L] | _p-tau [ng/L] | _Abeta-42 [ng/L] | _Abeta-40 [ng/L] | _Abeta-42/Abeta-40 ratio | _primary biochemical AD classification | _clinical AD diagnosis | _MMSE score | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||
| Sample_000 | Sweden | 71.000 | f | 703.000 | 85.000 | 562.000 | NaN | NaN | biochemical control | NaN | NaN |
| Sample_001 | Sweden | 77.000 | m | 518.000 | 91.000 | 334.000 | NaN | NaN | biochemical AD | NaN | NaN |
| Sample_002 | Sweden | 75.000 | m | 974.000 | 87.000 | 515.000 | NaN | NaN | biochemical AD | NaN | NaN |
| Sample_003 | Sweden | 72.000 | f | 950.000 | 109.000 | 394.000 | NaN | NaN | biochemical AD | NaN | NaN |
| Sample_004 | Sweden | 63.000 | f | 873.000 | 88.000 | 234.000 | NaN | NaN | biochemical AD | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Sample_205 | Berlin | 69.000 | f | 1,945.000 | NaN | 699.000 | 12,140.000 | 0.058 | biochemical AD | AD | 17.000 |
| Sample_206 | Berlin | 73.000 | m | 299.000 | NaN | 1,420.000 | 16,571.000 | 0.086 | biochemical control | non-AD | 28.000 |
| Sample_207 | Berlin | 71.000 | f | 262.000 | NaN | 639.000 | 9,663.000 | 0.066 | biochemical control | non-AD | 28.000 |
| Sample_208 | Berlin | 83.000 | m | 289.000 | NaN | 1,436.000 | 11,285.000 | 0.127 | biochemical control | non-AD | 24.000 |
| Sample_209 | Berlin | 63.000 | f | 591.000 | NaN | 1,299.000 | 11,232.000 | 0.116 | biochemical control | non-AD | 29.000 |
210 rows × 11 columns
Produce some addional simulated samples#
The validation simulated NA is used to by all models to evaluate training performance.
| observed | ||
|---|---|---|
| Sample ID | protein groups | |
| Sample_158 | Q9UN70;Q9UN70-2 | 14.630 |
| Sample_050 | Q9Y287 | 15.755 |
| Sample_107 | Q8N475;Q8N475-2 | 15.029 |
| Sample_199 | P06307 | 19.376 |
| Sample_067 | Q5VUB5 | 15.309 |
| ... | ... | ... |
| Sample_111 | F6SYF8;Q9UBP4 | 22.822 |
| Sample_002 | A0A0A0MT36 | 18.165 |
| Sample_049 | Q8WY21;Q8WY21-2;Q8WY21-3;Q8WY21-4 | 15.525 |
| Sample_182 | Q8NFT8 | 14.379 |
| Sample_123 | Q16853;Q16853-2 | 14.504 |
12600 rows × 1 columns
| observed | |
|---|---|
| count | 12,600.000 |
| mean | 16.339 |
| std | 2.741 |
| min | 7.209 |
| 25% | 14.412 |
| 50% | 15.935 |
| 75% | 17.910 |
| max | 30.140 |
Data in wide format#
Autoencoder need data in wide format
| protein groups | A0A024QZX5;A0A087X1N8;P35237 | A0A024R0T9;K7ER74;P02655 | A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 | A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 | A0A075B6H7 | A0A075B6H9 | A0A075B6I0 | A0A075B6I1 | A0A075B6I6 | A0A075B6I9 | ... | Q9Y653;Q9Y653-2;Q9Y653-3 | Q9Y696 | Q9Y6C2 | Q9Y6N6 | Q9Y6N7;Q9Y6N7-2;Q9Y6N7-4 | Q9Y6R7 | Q9Y6X5 | Q9Y6Y8;Q9Y6Y8-2 | Q9Y6Y9 | S4R3U6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||||||||||||
| Sample_000 | 15.912 | 16.852 | 15.570 | 16.481 | 17.301 | 20.246 | 16.764 | 17.584 | 16.988 | 20.054 | ... | 16.012 | 15.178 | NaN | 15.050 | 16.842 | NaN | NaN | 19.563 | NaN | 12.805 |
| Sample_001 | NaN | 16.874 | 15.519 | 16.387 | NaN | 19.941 | 18.786 | 17.144 | NaN | 19.067 | ... | 15.528 | 15.576 | NaN | 14.833 | 16.597 | 20.299 | 15.556 | 19.386 | 13.970 | 12.442 |
| Sample_002 | 16.111 | NaN | 15.935 | 16.416 | 18.175 | 19.251 | 16.832 | 15.671 | 17.012 | 18.569 | ... | 15.229 | 14.728 | 13.757 | 15.118 | 17.440 | 19.598 | 15.735 | 20.447 | 12.636 | 12.505 |
| Sample_003 | 16.107 | 17.032 | 15.802 | 16.979 | 15.963 | 19.628 | 17.852 | 18.877 | 14.182 | 18.985 | ... | 15.495 | 14.590 | 14.682 | 15.140 | 17.356 | 19.429 | NaN | 20.216 | NaN | 12.445 |
| Sample_004 | 15.603 | 15.331 | 15.375 | 16.679 | NaN | 20.450 | 18.682 | 17.081 | 14.140 | 19.686 | ... | 14.757 | NaN | NaN | 15.256 | 17.075 | 19.582 | 15.328 | NaN | 13.145 | NaN |
5 rows × 1421 columns
Fill Validation data with potentially missing features#
| protein groups | A0A024QZX5;A0A087X1N8;P35237 | A0A024R0T9;K7ER74;P02655 | A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 | A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 | A0A075B6H7 | A0A075B6H9 | A0A075B6I0 | A0A075B6I1 | A0A075B6I6 | A0A075B6I9 | ... | Q9Y653;Q9Y653-2;Q9Y653-3 | Q9Y696 | Q9Y6C2 | Q9Y6N6 | Q9Y6N7;Q9Y6N7-2;Q9Y6N7-4 | Q9Y6R7 | Q9Y6X5 | Q9Y6Y8;Q9Y6Y8-2 | Q9Y6Y9 | S4R3U6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||||||||||||
| Sample_000 | 15.912 | 16.852 | 15.570 | 16.481 | 17.301 | 20.246 | 16.764 | 17.584 | 16.988 | 20.054 | ... | 16.012 | 15.178 | NaN | 15.050 | 16.842 | NaN | NaN | 19.563 | NaN | 12.805 |
| Sample_001 | NaN | 16.874 | 15.519 | 16.387 | NaN | 19.941 | 18.786 | 17.144 | NaN | 19.067 | ... | 15.528 | 15.576 | NaN | 14.833 | 16.597 | 20.299 | 15.556 | 19.386 | 13.970 | 12.442 |
| Sample_002 | 16.111 | NaN | 15.935 | 16.416 | 18.175 | 19.251 | 16.832 | 15.671 | 17.012 | 18.569 | ... | 15.229 | 14.728 | 13.757 | 15.118 | 17.440 | 19.598 | 15.735 | 20.447 | 12.636 | 12.505 |
| Sample_003 | 16.107 | 17.032 | 15.802 | 16.979 | 15.963 | 19.628 | 17.852 | 18.877 | 14.182 | 18.985 | ... | 15.495 | 14.590 | 14.682 | 15.140 | 17.356 | 19.429 | NaN | 20.216 | NaN | 12.445 |
| Sample_004 | 15.603 | 15.331 | 15.375 | 16.679 | NaN | 20.450 | 18.682 | 17.081 | 14.140 | 19.686 | ... | 14.757 | NaN | NaN | 15.256 | 17.075 | 19.582 | 15.328 | NaN | 13.145 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Sample_205 | 15.682 | 16.886 | 14.910 | 16.482 | NaN | 17.705 | 17.039 | NaN | 16.413 | 19.102 | ... | NaN | 15.684 | 14.236 | 15.415 | 17.551 | 17.922 | 16.340 | 19.928 | 12.929 | NaN |
| Sample_206 | 15.798 | 17.554 | 15.600 | 15.938 | NaN | 18.154 | 18.152 | 16.503 | 16.860 | 18.538 | ... | 15.422 | 16.106 | NaN | 15.345 | 17.084 | 18.708 | NaN | 19.433 | NaN | NaN |
| Sample_207 | 15.739 | NaN | 15.469 | 16.898 | NaN | 18.636 | 17.950 | 16.321 | 16.401 | 18.849 | ... | 15.808 | 16.098 | 14.403 | 15.715 | NaN | 18.725 | 16.138 | 19.599 | 13.637 | 11.174 |
| Sample_208 | 15.477 | 16.779 | 14.995 | 16.132 | NaN | 14.908 | NaN | NaN | 16.119 | 18.368 | ... | 15.157 | 16.712 | NaN | 14.640 | 16.533 | 19.411 | 15.807 | 19.545 | NaN | NaN |
| Sample_209 | NaN | 17.261 | 15.175 | 16.235 | NaN | 17.893 | 17.744 | 16.371 | 15.780 | 18.806 | ... | 15.237 | 15.652 | 15.211 | 14.205 | 16.749 | 19.275 | 15.732 | 19.577 | 11.042 | 11.791 |
210 rows × 1421 columns
| protein groups | A0A024QZX5;A0A087X1N8;P35237 | A0A024R0T9;K7ER74;P02655 | A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 | A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 | A0A075B6H7 | A0A075B6H9 | A0A075B6I0 | A0A075B6I1 | A0A075B6I6 | A0A075B6I9 | ... | Q9Y653;Q9Y653-2;Q9Y653-3 | Q9Y696 | Q9Y6C2 | Q9Y6N6 | Q9Y6N7;Q9Y6N7-2;Q9Y6N7-4 | Q9Y6R7 | Q9Y6X5 | Q9Y6Y8;Q9Y6Y8-2 | Q9Y6Y9 | S4R3U6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||||||||||||
| Sample_000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 19.863 | NaN | NaN | NaN | NaN |
| Sample_001 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_002 | NaN | 14.523 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_003 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_004 | NaN | NaN | NaN | NaN | 15.473 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 14.048 | NaN | NaN | NaN | NaN | 19.867 | NaN | 12.235 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Sample_205 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 11.802 |
| Sample_206 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_207 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_208 | NaN | NaN | NaN | NaN | NaN | NaN | 17.530 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_209 | 15.727 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
210 rows × 1419 columns
| protein groups | A0A024QZX5;A0A087X1N8;P35237 | A0A024R0T9;K7ER74;P02655 | A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 | A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 | A0A075B6H7 | A0A075B6H9 | A0A075B6I0 | A0A075B6I1 | A0A075B6I6 | A0A075B6I9 | ... | Q9Y653;Q9Y653-2;Q9Y653-3 | Q9Y696 | Q9Y6C2 | Q9Y6N6 | Q9Y6N7;Q9Y6N7-2;Q9Y6N7-4 | Q9Y6R7 | Q9Y6X5 | Q9Y6Y8;Q9Y6Y8-2 | Q9Y6Y9 | S4R3U6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | |||||||||||||||||||||
| Sample_000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | 19.863 | NaN | NaN | NaN | NaN |
| Sample_001 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_002 | NaN | 14.523 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_003 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_004 | NaN | NaN | NaN | NaN | 15.473 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 14.048 | NaN | NaN | NaN | NaN | 19.867 | NaN | 12.235 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Sample_205 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 11.802 |
| Sample_206 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_207 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_208 | NaN | NaN | NaN | NaN | NaN | NaN | 17.530 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sample_209 | 15.727 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
210 rows × 1421 columns
Denoising Autoencoder#
Analysis: DataLoaders, Model, transform#
Autoencoder(
(encoder): Sequential(
(0): Linear(in_features=1421, out_features=64, bias=True)
(1): Dropout(p=0.2, inplace=False)
(2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=0.1)
(4): Linear(in_features=64, out_features=10, bias=True)
)
(decoder): Sequential(
(0): Linear(in_features=10, out_features=64, bias=True)
(1): Dropout(p=0.2, inplace=False)
(2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=0.1)
(4): Linear(in_features=64, out_features=1421, bias=True)
)
)
Training#
Start Fit
- before_fit : [TrainEvalCallback, Recorder, ProgressCallback, EarlyStoppingCallback]
Start Epoch Loop
- before_epoch : [Recorder, ProgressCallback]
Start Train
- before_train : [TrainEvalCallback, Recorder, ProgressCallback]
Start Batch Loop
- before_batch : [ModelAdapter, CastToTensor]
- after_pred : [ModelAdapter]
- after_loss : [ModelAdapter]
- before_backward: []
- before_step : []
- after_step : []
- after_cancel_batch: []
- after_batch : [TrainEvalCallback, Recorder, ProgressCallback]
End Batch Loop
End Train
- after_cancel_train: [Recorder]
- after_train : [Recorder, ProgressCallback]
Start Valid
- before_validate: [TrainEvalCallback, Recorder, ProgressCallback]
Start Batch Loop
- **CBs same as train batch**: []
End Batch Loop
End Valid
- after_cancel_validate: [Recorder]
- after_validate : [Recorder, ProgressCallback]
End Epoch Loop
- after_cancel_epoch: []
- after_epoch : [Recorder, EarlyStoppingCallback]
End Fit
- after_cancel_fit: []
- after_fit : [ProgressCallback, EarlyStoppingCallback]
Adding a EarlyStoppingCallback results in an error. Potential fix in
PR3509 is not yet in
current version. Try again later
SuggestedLRs(valley=0.019054606556892395)
dump model config
| epoch | train_loss | valid_loss | time |
|---|---|---|---|
| 0 | 65002.917969 | 4045.061279 | 00:00 |
| 1 | 63314.085938 | 4027.020264 | 00:00 |
| 2 | 61894.058594 | 3962.450195 | 00:00 |
| 3 | 60502.464844 | 3857.962158 | 00:00 |
| 4 | 59093.429688 | 3728.370117 | 00:00 |
| 5 | 57717.972656 | 3588.760742 | 00:00 |
| 6 | 56365.960938 | 3449.551514 | 00:00 |
| 7 | 55021.558594 | 3313.929199 | 00:00 |
| 8 | 53651.035156 | 3197.911377 | 00:00 |
| 9 | 52336.496094 | 3100.087402 | 00:00 |
| 10 | 51114.496094 | 3023.931152 | 00:00 |
| 11 | 49934.375000 | 2969.193115 | 00:00 |
| 12 | 48785.789062 | 2912.542969 | 00:00 |
| 13 | 47762.863281 | 2843.558594 | 00:00 |
| 14 | 46803.207031 | 2795.947021 | 00:00 |
| 15 | 45917.484375 | 2743.076904 | 00:00 |
| 16 | 45098.398438 | 2691.372559 | 00:00 |
| 17 | 44299.406250 | 2646.625488 | 00:00 |
| 18 | 43568.789062 | 2600.220459 | 00:00 |
| 19 | 42841.480469 | 2565.581787 | 00:00 |
| 20 | 42135.148438 | 2533.124756 | 00:00 |
| 21 | 41486.675781 | 2497.457764 | 00:00 |
| 22 | 40869.441406 | 2474.675781 | 00:00 |
| 23 | 40239.757812 | 2468.565674 | 00:00 |
| 24 | 39659.492188 | 2441.065918 | 00:00 |
| 25 | 39089.554688 | 2421.241943 | 00:00 |
| 26 | 38526.484375 | 2407.016602 | 00:00 |
| 27 | 38036.601562 | 2401.210938 | 00:00 |
| 28 | 37519.957031 | 2392.945312 | 00:00 |
| 29 | 37029.628906 | 2362.765137 | 00:00 |
| 30 | 36565.273438 | 2358.863770 | 00:00 |
| 31 | 36136.730469 | 2349.886230 | 00:00 |
| 32 | 35774.527344 | 2339.795898 | 00:00 |
| 33 | 35387.816406 | 2321.933594 | 00:00 |
| 34 | 35013.593750 | 2340.554932 | 00:00 |
| 35 | 34673.410156 | 2334.715576 | 00:00 |
| 36 | 34332.773438 | 2322.361572 | 00:00 |
| 37 | 33997.835938 | 2328.472168 | 00:00 |
| 38 | 33674.050781 | 2314.470703 | 00:00 |
| 39 | 33373.988281 | 2306.414062 | 00:00 |
| 40 | 33093.554688 | 2322.077881 | 00:00 |
| 41 | 32819.289062 | 2304.332520 | 00:00 |
| 42 | 32552.427734 | 2287.223145 | 00:00 |
| 43 | 32271.947266 | 2292.134033 | 00:00 |
| 44 | 32024.626953 | 2292.601318 | 00:00 |
| 45 | 31792.214844 | 2296.230713 | 00:00 |
| 46 | 31554.642578 | 2256.862793 | 00:00 |
| 47 | 31303.300781 | 2248.208008 | 00:00 |
| 48 | 31076.529297 | 2272.790527 | 00:00 |
| 49 | 30902.435547 | 2254.845703 | 00:00 |
| 50 | 30698.267578 | 2278.974609 | 00:00 |
| 51 | 30519.890625 | 2268.012207 | 00:00 |
| 52 | 30347.113281 | 2273.012451 | 00:00 |
| 53 | 30145.714844 | 2286.565674 | 00:00 |
| 54 | 30047.281250 | 2282.713867 | 00:00 |
| 55 | 29882.919922 | 2317.576904 | 00:00 |
| 56 | 29705.130859 | 2353.680664 | 00:00 |
| 57 | 29560.232422 | 2331.623047 | 00:00 |
| 58 | 29459.181641 | 2341.174316 | 00:00 |
| 59 | 29333.689453 | 2338.856689 | 00:00 |
| 60 | 29230.041016 | 2246.905273 | 00:00 |
| 61 | 29103.416016 | 2251.561523 | 00:00 |
| 62 | 28965.677734 | 2247.955566 | 00:00 |
| 63 | 28840.509766 | 2254.363037 | 00:00 |
| 64 | 28742.685547 | 2224.609863 | 00:00 |
| 65 | 28636.369141 | 2257.565186 | 00:00 |
| 66 | 28519.111328 | 2204.981934 | 00:00 |
| 67 | 28413.441406 | 2231.166748 | 00:00 |
| 68 | 28337.906250 | 2231.011719 | 00:00 |
| 69 | 28249.269531 | 2211.728516 | 00:00 |
| 70 | 28113.943359 | 2234.889404 | 00:00 |
| 71 | 28014.195312 | 2209.684082 | 00:00 |
| 72 | 27897.753906 | 2194.407715 | 00:00 |
| 73 | 27802.175781 | 2211.524658 | 00:00 |
| 74 | 27724.882812 | 2246.486816 | 00:00 |
| 75 | 27610.228516 | 2231.846680 | 00:00 |
| 76 | 27554.283203 | 2239.367432 | 00:00 |
| 77 | 27511.433594 | 2214.609375 | 00:00 |
| 78 | 27426.509766 | 2227.199219 | 00:00 |
| 79 | 27318.203125 | 2190.445557 | 00:00 |
| 80 | 27241.490234 | 2215.956055 | 00:00 |
| 81 | 27150.675781 | 2202.796631 | 00:00 |
| 82 | 27109.103516 | 2193.553955 | 00:00 |
| 83 | 27043.857422 | 2225.579346 | 00:00 |
| 84 | 27003.794922 | 2212.295898 | 00:00 |
| 85 | 26934.875000 | 2216.331787 | 00:00 |
| 86 | 26856.279297 | 2193.947998 | 00:00 |
| 87 | 26791.341797 | 2250.476807 | 00:00 |
| 88 | 26731.785156 | 2184.603027 | 00:00 |
| 89 | 26667.652344 | 2228.449219 | 00:00 |
| 90 | 26614.562500 | 2204.157227 | 00:00 |
| 91 | 26588.505859 | 2217.975830 | 00:00 |
| 92 | 26552.238281 | 2175.820312 | 00:00 |
| 93 | 26521.072266 | 2205.719971 | 00:00 |
| 94 | 26491.107422 | 2218.263428 | 00:00 |
| 95 | 26449.472656 | 2201.537109 | 00:00 |
| 96 | 26388.552734 | 2244.622070 | 00:00 |
| 97 | 26336.363281 | 2202.004883 | 00:00 |
| 98 | 26244.095703 | 2209.449707 | 00:00 |
| 99 | 26221.265625 | 2193.099854 | 00:00 |
| 100 | 26195.919922 | 2207.344971 | 00:00 |
| 101 | 26143.548828 | 2212.516602 | 00:00 |
| 102 | 26131.076172 | 2201.515137 | 00:00 |
| 103 | 26140.496094 | 2203.062500 | 00:00 |
| 104 | 26148.599609 | 2199.807373 | 00:00 |
| 105 | 26101.966797 | 2199.721436 | 00:00 |
| 106 | 26085.728516 | 2233.840820 | 00:00 |
| 107 | 26074.435547 | 2185.741455 | 00:00 |
| 108 | 26019.355469 | 2177.695801 | 00:00 |
| 109 | 25994.490234 | 2186.942383 | 00:00 |
| 110 | 25946.757812 | 2209.610352 | 00:00 |
| 111 | 25874.150391 | 2195.255859 | 00:00 |
| 112 | 25878.562500 | 2193.068359 | 00:00 |
| 113 | 25869.705078 | 2208.444092 | 00:00 |
| 114 | 25855.814453 | 2214.193848 | 00:00 |
| 115 | 25799.027344 | 2187.009277 | 00:00 |
| 116 | 25811.468750 | 2203.609131 | 00:00 |
| 117 | 25775.851562 | 2186.638916 | 00:00 |
No improvement since epoch 92: early stopping
Save number of actually trained epochs
118
Loss normalized by total number of measurements#
pimmslearn.plotting - INFO Saved Figures to runs/alzheimer_study/figures/dae_training
Why is the validation loss better then the training loss?
during training input data is masked and needs to be reconstructed
when evaluating the model, all input data is provided and only the artifically masked data is used for evaluation.
Predictions#
data of training data set and validation dataset to create predictions is the same as training data.
predictions include missing values (which are not further compared)
[ ] double check ModelAdapter
create predictiona and select for validation data
Sample ID protein groups
Sample_000 A0A024QZX5;A0A087X1N8;P35237 15.919
A0A024R0T9;K7ER74;P02655 16.720
A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8 15.824
A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503 16.792
A0A075B6H7 16.800
...
Sample_209 Q9Y6R7 19.188
Q9Y6X5 15.792
Q9Y6Y8;Q9Y6Y8-2 19.422
Q9Y6Y9 10.999
S4R3U6 11.322
Length: 298410, dtype: float32
| observed | DAE | ||
|---|---|---|---|
| Sample ID | protein groups | ||
| Sample_158 | Q9UN70;Q9UN70-2 | 14.630 | 15.599 |
| Sample_050 | Q9Y287 | 15.755 | 16.924 |
| Sample_107 | Q8N475;Q8N475-2 | 15.029 | 14.335 |
| Sample_199 | P06307 | 19.376 | 19.050 |
| Sample_067 | Q5VUB5 | 15.309 | 14.976 |
| ... | ... | ... | ... |
| Sample_111 | F6SYF8;Q9UBP4 | 22.822 | 22.964 |
| Sample_002 | A0A0A0MT36 | 18.165 | 15.849 |
| Sample_049 | Q8WY21;Q8WY21-2;Q8WY21-3;Q8WY21-4 | 15.525 | 15.764 |
| Sample_182 | Q8NFT8 | 14.379 | 13.944 |
| Sample_123 | Q16853;Q16853-2 | 14.504 | 14.422 |
12600 rows × 2 columns
| observed | DAE | ||
|---|---|---|---|
| Sample ID | protein groups | ||
| Sample_000 | A0A075B6P5;P01615 | 17.016 | 17.028 |
| A0A087X089;Q16627;Q16627-2 | 18.280 | 17.984 | |
| A0A0B4J2B5;S4R460 | 21.735 | 22.286 | |
| A0A140T971;O95865;Q5SRR8;Q5SSV3 | 14.603 | 15.247 | |
| A0A140TA33;A0A140TA41;A0A140TA52;P22105;P22105-3;P22105-4 | 16.143 | 16.779 | |
| ... | ... | ... | ... |
| Sample_209 | Q96ID5 | 16.074 | 16.021 |
| Q9H492;Q9H492-2 | 13.173 | 13.360 | |
| Q9HC57 | 14.207 | 13.733 | |
| Q9NPH3;Q9NPH3-2;Q9NPH3-5 | 14.962 | 15.218 | |
| Q9UGM5;Q9UGM5-2 | 16.871 | 16.372 |
12600 rows × 2 columns
save missing values predictions
Sample ID protein groups
Sample_000 A0A075B6J9 15.429
A0A075B6Q5 15.763
A0A075B6R2 16.642
A0A075B6S5 15.938
A0A087WSY4 16.387
...
Sample_209 Q9P1W8;Q9P1W8-2;Q9P1W8-4 15.959
Q9UI40;Q9UI40-2 15.769
Q9UIW2 16.566
Q9UMX0;Q9UMX0-2;Q9UMX0-4 13.352
Q9UP79 15.988
Name: intensity, Length: 46401, dtype: float32
Plots#
validation data
| latent dimension 1 | latent dimension 2 | latent dimension 3 | latent dimension 4 | latent dimension 5 | latent dimension 6 | latent dimension 7 | latent dimension 8 | latent dimension 9 | latent dimension 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Sample ID | ||||||||||
| Sample_000 | -0.905 | 4.043 | 2.623 | 3.535 | -0.234 | -1.163 | -3.534 | 3.812 | -0.209 | -0.926 |
| Sample_001 | 1.529 | 3.252 | 3.897 | 3.636 | -1.195 | -0.321 | -2.300 | 0.298 | 0.488 | -1.193 |
| Sample_002 | 4.432 | 3.462 | 2.947 | 3.272 | 2.463 | 1.479 | 1.972 | 5.474 | 0.124 | -1.511 |
| Sample_003 | -0.497 | 3.897 | 3.246 | 3.344 | 1.239 | 0.871 | -3.199 | 4.417 | -2.113 | -1.333 |
| Sample_004 | -1.036 | 5.066 | 2.683 | 4.227 | 1.390 | -1.184 | -1.148 | 1.814 | -0.879 | 0.181 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Sample_205 | 2.678 | 4.119 | -1.004 | -2.038 | 3.736 | 1.291 | -0.517 | 1.485 | 2.105 | 1.853 |
| Sample_206 | 6.814 | 2.888 | -1.416 | -1.369 | -0.297 | -3.705 | -2.870 | -0.342 | -0.050 | -2.648 |
| Sample_207 | -0.156 | 6.658 | 0.263 | -2.422 | 3.522 | -6.708 | -0.792 | -0.219 | 2.916 | -0.972 |
| Sample_208 | 2.361 | 5.283 | -0.481 | -1.349 | 0.385 | -0.181 | -0.830 | -3.661 | 1.226 | -2.874 |
| Sample_209 | 3.217 | 4.365 | 2.953 | -2.907 | 3.043 | 0.881 | 2.342 | -0.953 | -3.196 | -6.760 |
210 rows × 10 columns
Comparisons#
Simulated NAs : Artificially created NAs. Some data was sampled and set explicitly to misssing before it was fed to the model for reconstruction.
Validation data#
all measured (identified, observed) peptides in validation data
The simulated NA for the validation step are real test data (not used for training nor early stopping)
Selected as truth to compare to: observed
{'DAE': {'MSE': 0.4628965604918746,
'MAE': 0.43483768924660204,
'N': 12600,
'prop': 1.0}}
Test Datasplit#
Selected as truth to compare to: observed
{'DAE': {'MSE': 0.4756158306377139,
'MAE': 0.43664295761063754,
'N': 12600,
'prop': 1.0}}
Save all metrics as json
{ 'test_simulated_na': { 'DAE': { 'MAE': 0.43664295761063754,
'MSE': 0.4756158306377139,
'N': 12600,
'prop': 1.0}},
'valid_simulated_na': { 'DAE': { 'MAE': 0.43483768924660204,
'MSE': 0.4628965604918746,
'N': 12600,
'prop': 1.0}}}
| subset | valid_simulated_na | test_simulated_na | |
|---|---|---|---|
| model | metric_name | ||
| DAE | MSE | 0.463 | 0.476 |
| MAE | 0.435 | 0.437 | |
| N | 12,600.000 | 12,600.000 | |
| prop | 1.000 | 1.000 |
Save predictions#
Config#
{}
{'M': 1421,
'batch_size': 64,
'cuda': False,
'data': Path('runs/alzheimer_study/data'),
'epoch_trained': 118,
'epochs_max': 300,
'file_format': 'csv',
'fn_rawfile_metadata': 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/meta.csv',
'folder_data': '',
'folder_experiment': Path('runs/alzheimer_study'),
'hidden_layers': [64],
'latent_dim': 10,
'meta_cat_col': None,
'meta_date_col': None,
'model': 'DAE',
'model_key': 'DAE',
'n_params': 184983,
'out_figures': Path('runs/alzheimer_study/figures'),
'out_folder': Path('runs/alzheimer_study'),
'out_metrics': Path('runs/alzheimer_study'),
'out_models': Path('runs/alzheimer_study'),
'out_preds': Path('runs/alzheimer_study/preds'),
'patience': 25,
'sample_idx_position': 0,
'save_pred_real_na': True}