Public interfaces

This module provides some function to read, explore and preprocess an input: dataframe cointaining a set of features. Specifically, it allows to: - read data from a file and build a dataframe; - explore data info and make some plots - add/remove features from the dataframe; - split data in cases and controls group; - normalize and harmonize data;

It may be also used as a standalone program to explore the dataset.

preprocess.add_WhiteVol_feature(dataframe)[source]

Adds a column with total brain’s white matter volume.

Parameters:: dataframe (pandas DataFrame) – Dataframe to be passed to the function.

preprocess.data_info(dataframe)[source]

Shows some useful information about dataset’s features.

Parameters:: dataframe (pandas DataFrame) – Dataframe containing the features of each subject

preprocess.df_split(dataframe)[source]

Splits the dataframe’s subjects in two different groups based on their clinical classification: ASD (Autism Spectre Disorder) and Controls.

Parameters:

dataframe (pandas DataFrame) – The dataframe of data to be split.

Returns:

df_AS (pandas DataFrame) – Dataframe containing ASD cases.
df_TD (pandas DataFrame) – Dataframe containing controls.

preprocess.drop_covars(dataframe)[source]

Drops the following columns with covariate and confounding variables from the dataframe: “SITE”,”AGE_AT_SCAN”,”DX_GROUP”,”SEX”. “FIQ”.

Parameters:

dataframe (pandas DataFrame) – Dataframe from wich will be dropped the indicated columns.

Returns:

dataframe (pandas Dataframe) – Dataframe without the aforementioned columns.
covar_list (list) – List of strings containing the name of the dropped columns.

preprocess.neuroharmonize(dataframe, covariate=['SITE', 'AGE_AT_SCAN'])[source]

Harmonize dataset using neuroHarmonize, a harmonization tools for multi-site neuroimaging analysis. Workflow: 1-Load your data and all numeric covariates; 2-Run harmonization and store the adjusted data.

Parameters:

dataframe (pandas DataFrame) – Input dataframe containing all covariates to control for during harmonization. Must have a single column named ‘SITE’ with labels that identifies sites.
covariate (list, default=['AGE_AT_SCAN']) – List of strings of the covariates to be preserved during harmonization. All covariates must be encoded numerically (no categorical variables) and list must contain a single column “SITE” with site labels.

Returns:

df_neuro_harmonized – Dataframe containing harmonized data.

Return type:

pandas DataFrame

preprocess.normalization(dataframe)[source]

Makes data normalization by scaling each feature to (0,1) range.

Parameters:: dataframe (pandas DataFrame) – Dataframe to be passed to the function.
Returns:: norm_df – Normalized dataframe.
Return type:: pandas Dataframe

preprocess.plot_box(dataframe, feat_x, feat_y)[source]: Draw a box plot to show distributions of a feature with respect of another. Plots will be saved in data_plots folder. :param dataframe: Dataframe where the specified features are taken from. :type dataframe: pandas DataFrame :param feat_x: Feature showed on the x-axis of the boxplot. :type feat_x: string :param feat_y: Feature showed on the y-axis of the boxplot. :type feat_y: string

preprocess.plot_histogram(dataframe, feature)[source]

Plots histogram of a given feature of the dataframe. Plots will be saved in data_plots folder.

Parameters:

dataframe (pandas DataFrame) – Dataframe to which apply hist method.
feature (string) – Feature to plot the histogram of.

preprocess.read_df(dataset_path)[source]

Reads a .csv file from data file and returns it as a pandas dataframe. ID and acquisition site of each subject contained into the dataframe are extracted from the “FILE_ID” column. The latter is stored in a proper dataframe’s column, while the first is used as dataframe index.

Parameters:: dataset_path (str) – Path to the dataset file.
Returns:: df – Dataframe containing the features of each subject by rows.
Return type:: pandas DataFrame

Main module in which different models are being compared on ABIDE dataset using GridSearchCV. User must specify if harmonization by provenance site should be performed, using the proper command from terminal(see helper). If nothing’s being stated, harmonization won’t be performed. Best estimators found are saved in local and used to make prediction of age on test set.

Workflow:

Read the ABIDE dataframe and make some preprocessing.

2. Split dataframe into cases and controls, the latter (CTR) in train and test set. 3. Cross validation on training set.

Best models setting will be saved in “best estimator” folder.

Best models are used to predict age on CTR train and test set and, finally,
on ASD dataset for a comparison of prediction between healthy subjects and cases.

For each dataset, all plots will be saved in “images” folder.

brain_age_pred.make_predict(dataframe, model_name, harm_flag=False)[source]

Loads pre-trained model to make prediction on “unseen” test datas.

Stores the score metrics of a prediction.

Parameters:

dataframe (pandas dataframe.) – Input dataframe to make predictions on.
model_name (string) – Name of the chosen model.
harm_flag (boolean, DEFAULT=False.) – Flag indicating if the dataframe has been previously harmonized.

Returns:

predicted_age (array-like.) – Array containing the predicted age of each subject.
y_test (pandas dataframe.) – Pandas dataframe column containing the ground truth age.
score_metrics (dictionary.) – Dictionary containing names of metrics as keys and result metrics for a specific model as values.

Helper module containing useful function for making plots of final scores.

predict_helper.plot_scores(y_test, age_predicted, metrics, model_name='Regressor model', dataframe_name='Dataframe')[source]

Plots the results of the predictions vs ground truth with related metrics scores.

Parameters:

y_test (pandas dataframe) – Pandas dataframe column containing the ground truth age.
age_predicted (array-like) – Array containing the predicted age of each subject.
metrics (dictionary) – Dictionary containing names of metrics as keys and result metrics . for a specific model as values.
model_name (string) – Model’s name, DEFAULT=”Regressor Model”
dataframe_name (string) – Dataframe’s name, DEFAULT=”Dataset Metrics”.

predict_helper.residual_plot(true_age1, pred_age1, true_age2, pred_age2, model_name, harm_flag)[source]

Computes the difference(delta) between predicted age find with a specific model and true age on control test and ASD dataframes.

Parameters:

true_age1 (array-like) – Test feature from the first dataframe.
pred_age1 (array-like) – Predicted feauture from the first dataframe.
true_age2 (array-like) – Test feature from the second dataframe.
pred_age2 (array-like) – Predicted feature from the second dataframe.
model_name (string-like) – Name of the model used for prediction.
harm_flag (boolean.) – Flag indicating if the dataframe on which prediction was performed has been previously harmonized.

Module containing functions to perform GridSearchCV cross validation for hyperparameters and parameters optimization.

grid_CV.model_tuner_cv(dataframe, model, model_name, harm_flag)[source]

Create a pipeline and make (Kfold) cross-validation for hyperparameters’ optimization.

It makes use of the hyperparameters’ grid dictionary in which, for each chosen model, are specified the values on which the GSCV will be performed.

Parameters:

dataframe (pandas dataframe) – Input dataframe containing data.
model (function) – Regression model function.
model_name (string) – Name of the used model.

Returns:

best_estimator – Model fitted with grid search cross validation for hyperparameters.

Return type:

object-like

Module for Deep Dense Network implementation.

class DDNregressor.AgeRegressor(learning_rate=0.001, batch_size=32, dropout_rate=0.2, epochs=100, verbose=False)[source]

Class describing a Deep Dense Network used as a linear regressor.

The class inherits from BaseEstimator for an easy implementation into scikit’s pipeline and grid search. Linear regression is performed using ‘mean absolute error’ as loss func to minimize.

Parameters:

dropout_rate (float) – Dropout rate value to be passed to dropout layer.
epochs (int) – Number of iterations on entire dataset.
verbose (bool, DEFAULT=False) – If True, prints the model’s summary.

model

Compiled model.

Type:: object

Examples

Layer
(type) Output Shape Param #

input_1 (InputLayer) [(None, 64)] 0

dense (Dense) (None, 128) 8320

dense_1 (Dense) (None, 64) 8256

dense_2 (Dense) (None, 32) 2080

dropout (Dropout) (None, 32) 0

dense_3 (Dense) (None, 16) 528

dense_4 (Dense) (None, 1) 17

Non-trainable params: 0

fit(X, y, call_backs=None, val_data=None)[source]

Fit method. Builds the NN and fits using MAE.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input datas on which fit will be performed.
y (array of shape (n_samples,)) – Array of labels used in train/validation.

predict(X)[source]

Predict method. Makes prediction on test data.

Parameters:: X (array-like of shape (n_samples, n_features)) – Input datas on which prediction will be performed..
Returns:: self.model.predict – Model prediction.
Return type:: array of shape (n_samples,)

Module testing the reproducibility of the results on each site’s dataframe. If no argument is stated from command line, the program will be executed without data harmonization. The analysis will be performed on control subjects.

Main module in which different models are being compared on ABIDE dataset. Training and prediction will be performed using datas from a specific site as test set, the others as train. User must specify if harmonization by provenance site should be performed, using the proper command from terminal(see helper). If nothing’s being stated, harmonization won’t be performed.

Workflow: 1. Read the ABIDE dataframe and make some preprocessing. 2. Split dataframe into cases and controls and subsequently split CTR set into

train/test.

Cross validation on training set.
Predict on site test set.

For each splitting, all plots will be saved in “images_SITE” folder. Metrics obtained from each cross validation are stored in “/metrics/site” folder.

brain_age_site.predict_on_site(x_pred, y_pred, model, site_name, model_name, harm_opt)[source]

Plots the results of the predictions vs ground truth with related metrics scores.

Parameters:

x_pred (array-like of shape (n_samples,n_features)) – Array of data on which perform prediction.
y_pred (array-like) – Array containing labels.
site_name (string) – Site’s name.
model_name (string) – Model’s name.
harm_opt (string.) – String indicating if the dataframe has been previously harmonized.

Returns:

age_predicted (array-like) – Array containing the predicted age of each subject.
score_metrics (dictionary) – Dictionary containing names of metrics as keys and result metrics . for a specific model as values.