Compare predictions between model and RSN#

see differences in imputation for diverging cases
dumps top5

Parameters#

folder_experiment = 'runs/appl_ald_data/plasma/proteinGroups'
fn_clinical_data = "data/ALD_study/processed/ald_metadata_cli.csv"
make_plots = True  # create histograms and swarmplots of diverging results
model_key = 'VAE'
sample_id_col = 'Sample ID'
target = 'kleiner'
cutoff_target: int = 2  # => for binarization target >= cutoff_target
out_folder = 'diff_analysis'
file_format = 'csv'
baseline = 'RSN'  # default is RSN, but could be any other trained model
template_pred = 'pred_real_na_{}.csv'  # fixed, do not change
ref_method_score = None  # filepath to reference method score

# Parameters
cutoff_target = 0.5
make_plots = False
ref_method_score = None
folder_experiment = "runs/alzheimer_study"
target = "AD"
baseline = "PI"
out_folder = "diff_analysis"
fn_clinical_data = "runs/alzheimer_study/data/clinical_data.csv"

root - INFO     Removed from global namespace: folder_experiment
root - INFO     Removed from global namespace: fn_clinical_data
root - INFO     Removed from global namespace: make_plots
root - INFO     Removed from global namespace: model_key
root - INFO     Removed from global namespace: sample_id_col
root - INFO     Removed from global namespace: target
root - INFO     Removed from global namespace: cutoff_target
root - INFO     Removed from global namespace: out_folder
root - INFO     Removed from global namespace: file_format
root - INFO     Removed from global namespace: baseline
root - INFO     Removed from global namespace: template_pred
root - INFO     Removed from global namespace: ref_method_score
root - INFO     Already set attribute: folder_experiment has value runs/alzheimer_study
root - INFO     Already set attribute: out_folder has value diff_analysis

{'baseline': 'PI',
 'cutoff_target': 0.5,
 'data': PosixPath('runs/alzheimer_study/data'),
 'file_format': 'csv',
 'fn_clinical_data': 'runs/alzheimer_study/data/clinical_data.csv',
 'folder_experiment': PosixPath('runs/alzheimer_study'),
 'folder_scores': PosixPath('runs/alzheimer_study/diff_analysis/AD/scores'),
 'make_plots': False,
 'model_key': 'VAE',
 'out_figures': PosixPath('runs/alzheimer_study/figures'),
 'out_folder': PosixPath('runs/alzheimer_study/diff_analysis/AD'),
 'out_metrics': PosixPath('runs/alzheimer_study'),
 'out_models': PosixPath('runs/alzheimer_study'),
 'out_preds': PosixPath('runs/alzheimer_study/preds'),
 'ref_method_score': None,
 'sample_id_col': 'Sample ID',
 'target': 'AD',
 'template_pred': 'pred_real_na_{}.csv'}

Write outputs to excel

root - INFO     Writing to excel file: runs/alzheimer_study/diff_analysis/AD/diff_analysis_compare_DA.xlsx

Load scores#

List dump of scores:

[PosixPath('runs/alzheimer_study/diff_analysis/AD/scores/diff_analysis_scores_VAE.pkl'),
 PosixPath('runs/alzheimer_study/diff_analysis/AD/scores/diff_analysis_scores_RF.pkl'),
 PosixPath('runs/alzheimer_study/diff_analysis/AD/scores/diff_analysis_scores_DAE.pkl'),
 PosixPath('runs/alzheimer_study/diff_analysis/AD/scores/diff_analysis_scores_QRILC.pkl'),
 PosixPath('runs/alzheimer_study/diff_analysis/AD/scores/diff_analysis_scores_TRKNN.pkl'),
 PosixPath('runs/alzheimer_study/diff_analysis/AD/scores/diff_analysis_scores_PI.pkl'),
 PosixPath('runs/alzheimer_study/diff_analysis/AD/scores/diff_analysis_scores_Median.pkl'),
 PosixPath('runs/alzheimer_study/diff_analysis/AD/scores/diff_analysis_scores_CF.pkl'),
 PosixPath('runs/alzheimer_study/diff_analysis/AD/scores/diff_analysis_scores_None.pkl')]

Load scores from dumps:

	model	VAE								RF		...	CF		None
	var	SS	DF	F	p-unc	np2	-Log10 pvalue	qvalue	rejected	SS	DF	...	qvalue	rejected	SS	DF	F	p-unc	np2	-Log10 pvalue	qvalue	rejected
protein groups	Source
A0A024QZX5;A0A087X1N8;P35237	AD	1.049	1	7.630	0.006	0.038	2.201	0.018	True	0.950	1	...	0.023	True	0.834	1.000	6.088	0.015	0.033	1.837	0.043	True
	age	0.007	1	0.052	0.821	0.000	0.086	0.881	False	0.001	1	...	0.851	False	0.002	1.000	0.015	0.903	0.000	0.044	0.943	False
	Kiel	0.271	1	1.972	0.162	0.010	0.791	0.266	False	0.182	1	...	0.263	False	0.145	1.000	1.061	0.304	0.006	0.517	0.461	False
	Magdeburg	0.462	1	3.362	0.068	0.017	1.166	0.133	False	0.363	1	...	0.142	False	0.273	1.000	1.996	0.159	0.011	0.797	0.286	False
	Sweden	1.666	1	12.119	0.001	0.060	3.209	0.002	True	1.466	1	...	0.002	True	1.209	1.000	8.827	0.003	0.047	2.472	0.013	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
S4R3U6	AD	1.869	1	3.758	0.054	0.019	1.267	0.109	False	1.247	1	...	0.098	False	0.095	1.000	0.151	0.698	0.001	0.156	0.803	False
	age	0.578	1	1.163	0.282	0.006	0.549	0.411	False	0.661	1	...	0.393	False	1.370	1.000	2.171	0.143	0.018	0.844	0.265	False
	Kiel	2.571	1	5.167	0.024	0.026	1.617	0.056	False	1.975	1	...	0.040	True	1.396	1.000	2.213	0.139	0.018	0.856	0.259	False
	Magdeburg	2.381	1	4.786	0.030	0.024	1.524	0.067	False	1.571	1	...	0.025	True	0.556	1.000	0.882	0.350	0.007	0.456	0.507	False
	Sweden	17.034	1	34.241	0.000	0.152	7.681	0.000	True	12.311	1	...	0.000	True	8.519	1.000	13.502	0.000	0.101	3.447	0.002	True

7105 rows × 72 columns

If reference dump is provided, add it to the scores

Load frequencies of observed features#

	data
	frequency
protein groups
A0A024QZX5;A0A087X1N8;P35237	186
A0A024R0T9;K7ER74;P02655	195
A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8	174
A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503	196
A0A075B6H7	91
...	...
Q9Y6R7	197
Q9Y6X5	173
Q9Y6Y8;Q9Y6Y8-2	197
Q9Y6Y9	119
S4R3U6	126

1421 rows × 1 columns

Assemble qvalues#

			VAE	RF	DAE	QRILC	TRKNN	PI	Median	CF	None
			qvalue	qvalue	qvalue	qvalue	qvalue	qvalue	qvalue	qvalue	qvalue
protein groups	Source	frequency
A0A024QZX5;A0A087X1N8;P35237	AD	186	0.018	0.022	0.017	0.071	0.023	0.728	0.039	0.023	0.043
A0A024R0T9;K7ER74;P02655	AD	195	0.070	0.074	0.076	0.076	0.071	0.139	0.087	0.080	0.092
A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8	AD	174	0.501	0.618	0.319	0.321	0.394	0.230	0.832	0.874	0.586
A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503	AD	196	0.382	0.394	0.372	0.466	0.396	0.672	0.418	0.391	0.404
A0A075B6H7	AD	91	0.019	0.009	0.043	0.214	0.048	0.223	0.124	0.001	0.027
...	...	...	...	...	...	...	...	...	...	...	...
Q9Y6R7	AD	197	0.284	0.291	0.282	0.302	0.289	0.315	0.315	0.285	0.307
Q9Y6X5	AD	173	0.328	0.297	0.369	0.159	0.205	0.121	0.455	0.214	0.501
Q9Y6Y8;Q9Y6Y8-2	AD	197	0.156	0.162	0.156	0.171	0.160	0.181	0.178	0.159	0.174
Q9Y6Y9	AD	119	0.624	0.538	0.855	0.635	0.472	0.464	0.667	0.901	0.651
S4R3U6	AD	126	0.109	0.182	0.159	0.637	0.080	0.680	0.829	0.098	0.803

1421 rows × 9 columns

Assemble pvalues#

			VAE	RF	DAE	QRILC	TRKNN	PI	Median	CF	None
			p-unc	p-unc	p-unc	p-unc	p-unc	p-unc	p-unc	p-unc	p-unc
protein groups	Source	frequency
A0A024QZX5;A0A087X1N8;P35237	AD	186	0.006	0.008	0.006	0.028	0.008	0.590	0.012	0.008	0.015
A0A024R0T9;K7ER74;P02655	AD	195	0.031	0.032	0.035	0.030	0.031	0.059	0.033	0.036	0.037
A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8	AD	174	0.370	0.485	0.204	0.190	0.264	0.114	0.736	0.809	0.432
A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503	AD	196	0.258	0.260	0.250	0.314	0.266	0.524	0.259	0.263	0.254
A0A075B6H7	AD	91	0.007	0.003	0.017	0.112	0.020	0.109	0.053	0.000	0.008
...	...	...	...	...	...	...	...	...	...	...	...
Q9Y6R7	AD	197	0.175	0.175	0.175	0.175	0.175	0.175	0.175	0.175	0.175
Q9Y6X5	AD	173	0.211	0.180	0.247	0.076	0.113	0.050	0.291	0.121	0.344
Q9Y6Y8;Q9Y6Y8-2	AD	197	0.083	0.083	0.083	0.083	0.083	0.083	0.083	0.083	0.083
Q9Y6Y9	AD	119	0.503	0.401	0.783	0.495	0.334	0.302	0.520	0.847	0.505
S4R3U6	AD	126	0.054	0.096	0.086	0.498	0.036	0.533	0.730	0.046	0.698

1421 rows × 9 columns

Assemble rejected features#

	VAE	RF	DAE	QRILC	TRKNN	PI	Median	CF	None
False	935	966	926	996	936	1,027	1,069	950	1,054
True	486	455	495	425	485	394	352	471	367

Tabulate rejected decisions by method:#

	VAE	RF	DAE	QRILC	TRKNN	PI	Median	CF	None
False	935	966	926	996	936	1,027	1,069	950	1,054
True	486	455	495	425	485	394	352	471	367

Tabulate rejected decisions by method for newly included features (if available)#

	VAE	RF	DAE	QRILC	TRKNN	PI	Median	CF	None

Tabulate rejected decisions by method for all features#

root - INFO     Written to sheet 'equality_rejected_all' in excel file.

			VAE	RF	DAE	QRILC	TRKNN	PI	Median	CF	None
			rejected	rejected	rejected	rejected	rejected	rejected	rejected	rejected	rejected
protein groups	Source	frequency
A0A024QZX5;A0A087X1N8;P35237	AD	186	True	True	True	False	True	False	True	True	True
A0A024R0T9;K7ER74;P02655	AD	195	False	False	False	False	False	False	False	False	False
A0A024R3W6;A0A024R412;O60462;O60462-2;O60462-3;O60462-4;O60462-5;Q7LBX6;X5D2Q8	AD	174	False	False	False	False	False	False	False	False	False
A0A024R644;A0A0A0MRU5;A0A1B0GWI2;O75503	AD	196	False	False	False	False	False	False	False	False	False
A0A075B6H7	AD	91	True	True	True	False	True	False	False	True	True
...	...	...	...	...	...	...	...	...	...	...	...
Q9Y6R7	AD	197	False	False	False	False	False	False	False	False	False
Q9Y6X5	AD	173	False	False	False	False	False	False	False	False	False
Q9Y6Y8;Q9Y6Y8-2	AD	197	False	False	False	False	False	False	False	False	False
Q9Y6Y9	AD	119	False	False	False	False	False	False	False	False	False
S4R3U6	AD	126	False	False	False	False	False	False	False	False	False

1421 rows × 9 columns

Tabulate number of equal decison by method (True) to the ones with varying decision depending on the method (False)

True    1,091
False     330
Name: count, dtype: int64

List frequency of features with varying decisions

		frequency
protein groups	Source
A0A024QZX5;A0A087X1N8;P35237	AD	186
A0A075B6H7	AD	91
A0A075B6H9	AD	189
A0A075B6J9	AD	156
A0A075B6Q5	AD	104
...	...	...
Q9UP79	AD	135
Q9UPU3	AD	163
Q9UQ52	AD	188
Q9Y281;Q9Y281-3	AD	51
Q9Y6C2	AD	119

330 rows × 1 columns

take only those with different decisions

No new features or no new ones (with diverging decisions.)

Plots for inspecting imputations (for diverging decisions)#

root - WARNING  Not plots requested.
/home/runner/work/pimms/pimms/project/.snakemake/conda/924ec7e362d761ecf0807b9074d79999_/lib/python3.12/site-packages/IPython/core/interactiveshell.py:3707: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

An exception has occurred, use %tb to see the full traceback.

SystemExit: 0

Load target#

Measurments#

plot all of the new pgs which are at least once significant which are not already dumped.

RSN prediction are based on all samples mean and std (N=455) as in original study
VAE also trained on all samples (self supervised) One could also reduce the selected data to only the samples with a valid target marker, but this was not done in the original study which considered several different target markers.

RSN : shifted per sample, not per feature!

Load all prediction files and reshape

Once imputation, reduce to target samples only (samples with target score)

Compare with target annotation#

Show code cell source

Hide code cell source

# labels somehow?
# target.replace({True: f' >={args.cutoff_target}', False: f'<{args.cutoff_target}'})

for i, idx in enumerate(feat_sel):
    print(f"Swarmplot {i:3<}: {idx}:")
    fig, ax = plt.subplots()

    # dummy plots, just to get the Path objects
    tmp_dot = ax.scatter([1, 2], [3, 4], marker='X')
    new_mk, = tmp_dot.get_paths()
    tmp_dot.remove()

    feat_observed = data[idx].dropna()

    def get_centered_label(method, n, q):
        model_str = f'{method}'
        stats_str = f'(N={n:,d}, q={q:.3f})'
        if len(model_str) > len(stats_str):
            stats_str = f"{stats_str:<{len(model_str)}}"
        else:
            model_str = f"{model_str:<{len(stats_str)}}"
        return f'{model_str}\n{stats_str}'

    key = get_centered_label(method='observed',
                             n=len(feat_observed),
                             q=float(qvalues.loc[idx, ('None', 'qvalue')])
                             )
    to_plot = {key: feat_observed}
    for method in model_keys:
        try:
            pred = pred_real_na.loc[pd.IndexSlice[:,
                                                  idx], method].dropna().droplevel(-1)
            if len(pred) == 0:
                # in case no values was imputed -> qvalue is as based on measured
                key = get_centered_label(method=method,
                                         n=len(pred),
                                         q=float(qvalues.loc[idx, ('None', 'qvalue')]
                                                 ))
            elif qvalues.loc[idx, (method, 'qvalue')].notna().all():
                key = get_centered_label(method=method,
                                         n=len(pred),
                                         q=float(qvalues.loc[idx, (method, 'qvalue')]
                                                 ))
            elif qvalues.loc[idx, (method, 'qvalue')].isna().all():
                logger.info(f"NA qvalues for {idx}: {method}")
                continue
            else:
                raise ValueError("Unknown case.")
            to_plot[key] = pred
        except KeyError:
            print(f"No missing values for {idx}: {method}")
            continue

    to_plot = pd.DataFrame.from_dict(to_plot)
    to_plot.columns.name = 'group'
    groups_order = to_plot.columns.to_list()
    to_plot = to_plot.stack().to_frame('intensity').reset_index(-1)
    to_plot = to_plot.join(target.astype('category'), how='inner')
    to_plot = to_plot.astype({'group': 'category'})

    ax = seaborn.swarmplot(data=to_plot,
                           x='group',
                           y='intensity',
                           order=groups_order,
                           dodge=True,
                           hue=args.target,
                           size=2,
                           ax=ax)
    first_pg = idx.split(";")[0]
    ax.set_title(
        f'Imputation for protein group {first_pg} with target {target_name} (N= {len(data):,d} samples)')

    _ = ax.set_ylim(min_y_int, max_y_int)
    _ = ax.locator_params(axis='y', integer=True)
    _ = ax.set_xlabel('')
    _xticks = ax.get_xticks()
    ax.xaxis.set_major_locator(
        matplotlib.ticker.FixedLocator(_xticks)
    )
    _ = ax.set_xticklabels(ax.get_xticklabels(), rotation=45,
                           horizontalalignment='right')

    N_hues = len(pd.unique(to_plot[args.target]))

    _ = ax.collections[0].set_paths([new_mk])
    _ = ax.collections[1].set_paths([new_mk])

    label_target_0, label_target_1 = ax.collections[-2].get_label(), ax.collections[-1].get_label()
    _ = ax.collections[-2].set_label(f'imputed, {label_target_0}')
    _ = ax.collections[-1].set_label(f'imputed, {label_target_1}')
    _obs_label0 = ax.scatter([], [], color='C0', marker='X', label=f'observed, {label_target_0}')
    _obs_label1 = ax.scatter([], [], color='C1', marker='X', label=f'observed, {label_target_1}')
    _ = ax.legend(
        handles=[_obs_label0, _obs_label1, *ax.collections[-4:-2]],
        fontsize=5, title_fontsize=5, markerscale=0.4,)
    fname = (folder /
             f'{first_pg}_swarmplot.pdf')
    files_out[fname.name] = fname.as_posix()
    pimmslearn.savefig(
        fig,
        name=fname)
    plt.close()

Saved files:

Compare predictions between model and RSN

Contents

Compare predictions between model and RSN#

Parameters#

Load scores#

Load frequencies of observed features#

Assemble qvalues#

Assemble pvalues#

Assemble rejected features#

Tabulate rejected decisions by method:#

Tabulate rejected decisions by method for newly included features (if available)#

Tabulate rejected decisions by method for all features#

Plots for inspecting imputations (for diverging decisions)#

Load target#

Measurments#

Compare with target annotation#