Utils¶

Util functions

scaler_mapper ¶

scaler_mapper(cont_cols, target_col, identifier, scaler_mapper_def=None)

Function that maps scaler functions to appropriate columns. By default assigns scaler to continuous feature columns . This behavior can be changed by scaler_mapper_def. Only columns defined in mapper object will be present in the transformed dataset.

Parameters:

Name	Type	Description	Default
`cont_cols`	`list`	list of continuousl feature columns in the dataset	required
`target_col`	`str`	target column	required
`identifier`	`str`	identifier column	required
`scaler_mapper_def`	`dict`	optional dictionary that contains keys ['cont_cols', 'target_col', 'identifier_col'] with their corresponding scaler functions from sklearn library	`None`

Returns:

Name	Type	Description
`scaler_mapper`	`DataFrameMapper`	scaler object mapping sklearn scalers to columns in pandas dataframe

Source code in src/utils.py

def scaler_mapper(
    cont_cols: List[str],
    target_col: str,
    identifier: str,
    scaler_mapper_def: Union[dict, None] = None,
):
    """Function that maps scaler functions to appropriate columns. By default assigns scaler to continuous feature columns
    . This behavior can be changed by scaler_mapper_def.
    Only columns defined in mapper object will be present in the transformed dataset.

    Args:
        cont_cols (list): list of continuousl feature columns in the dataset
        target_col (str): target column
        identifier (str): identifier column
        scaler_mapper_def (dict): optional dictionary that contains keys ['cont_cols', 'target_col',
            'identifier_col'] with their corresponding scaler functions from sklearn library

    Returns:
        scaler_mapper (DataFrameMapper): scaler object mapping sklearn scalers to columns in pandas dataframe
    """
    if scaler_mapper_def is None:
        cont_cols_def = gen_features(columns=list(map(lambda x: [x], cont_cols)), classes=[StandardScaler])

        target_col_def = [([target_col], None, {})]
        identifier_def = [([identifier], None, {})]

    else:
        cont_cols_def = gen_features(
            columns=list(map(lambda x: [x], cont_cols)),
            classes=[scaler_mapper_def["cont_cols"]],
        )

        target_col_def = [([target_col], scaler_mapper_def["target_col"], {})]
        identifier_def = [([identifier], scaler_mapper_def["identifier_col"], {})]

    scaler_mapper = DataFrameMapper(cont_cols_def + target_col_def + identifier_def, df_out=True)
    return scaler_mapper

optimize_df ¶

optimize_df(df, identifier, verbose=True)

Simple function to assign approporiate columns data types in pandas DataFrame

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	dataset	required
`identifier`	`str`	identifier column	required
`cat_cols`	`list`	list of categorical feature columns in the dataset	required
`verbose`	`boolean`	option to show reduced memory usage	`True`

Returns:

Name	Type	Description
`data`	`DataFrame`	optimized dataset

Source code in src/utils.py

def optimize_df(df: DataFrame, identifier: str, verbose: bool = True):
    """Simple function to assign approporiate columns data types in pandas DataFrame

    Args:
        df (DataFrame): dataset
        identifier (str): identifier column
        cat_cols (list): list of categorical feature columns in the dataset
        verbose (boolean): option to show reduced memory usage

    Returns:
        data (DataFrame): optimized dataset
    """
    data = df.convert_dtypes()
    data[identifier] = data[identifier].astype(str)
    if verbose:
        reduction = (1 - (data.memory_usage(deep=True).sum() / df.memory_usage(deep=True).sum())) * 100
        print(f"Memory usage reduced by {reduction:0.2f}%")
    return data

LGBM_custom_score ¶

LGBM_custom_score(n_class)

Class defining evaluation scores in case fobj, ie. focal loss is defined in LighGBM model training. From documentation: 'The predicted values. If fobj is specified, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.'

Source code in src/utils.py

def __init__(self, n_class: int):
    self.n_class = n_class

lgbm_accuracy ¶

lgbm_accuracy(preds_raw, lgbDataset)

Implementation of the accuracy score to be used as evaluation score for lightgbm. The adaptation is required since when using custom losses the row prediction needs to passed through a sigmoid to represent a probability.

Parameters:

Name	Type	Description	Default
`preds`	`ndarray`	predictions	required
`lgbDataset`	`lightgbm.Dataset`	dataset, containing labels, used for prediction	required

Returns:

Name	Type	Description
`result`	`tuple`	tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)

Source code in src/utils.py

def lgbm_accuracy(self, preds_raw: ndarray, lgbDataset: Dataset):
    """Implementation of the accuracy score to be used as evaluation
    score for lightgbm. The adaptation is required since when using custom losses
    the row prediction needs to passed through a sigmoid to represent a
    probability.

    Args:
        preds (ndarray): predictions
        lgbDataset (lightgbm.Dataset): dataset, containing labels, used for prediction

    Returns:
        result (tuple): tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)
    """
    y_true, preds = self._prediction(preds_raw=preds_raw, lgbDataset=lgbDataset)
    result = ("accuracy", accuracy_score(y_true, preds), True)
    return result

lgbm_f1 ¶

lgbm_f1(preds_raw, lgbDataset)

Implementation of the f1 score to be used as evaluation score for lightgbm see feval documentation. The adaptation is required since when using custom losses the row prediction needs to passed through a sigmoid to represent a probability.

Parameters:

Name	Type	Description	Default
`preds`	`ndarray`	predictions	required
`lgbDataset`	`lightgbm.Dataset`	dataset, containing labels, used for prediction	required

Returns:

Name	Type	Description
`result`	`tuple`	tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)

Source code in src/utils.py

def lgbm_f1(self, preds_raw: ndarray, lgbDataset: Dataset):
    """Implementation of the f1 score to be used as evaluation score for lightgbm
    see feval [documentation](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html).
    The adaptation is required since when using custom losses
    the row prediction needs to passed through a sigmoid to represent a
    probability.

    Args:
        preds (ndarray): predictions
        lgbDataset (lightgbm.Dataset): dataset, containing labels, used for prediction

    Returns:
        result (tuple): tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)
    """
    y_true, preds = self._prediction(preds_raw=preds_raw, lgbDataset=lgbDataset)
    result = ("f1", f1_score(y_true, preds, average="weighted"), True)
    return result

lgbm_focal_loss ¶

lgbm_focal_loss(preds_raw, lgbDataset, alpha, gamma)

Adapation of the Focal Loss for lightgbm to be used as training loss. See original paper: * https://arxiv.org/pdf/1708.02002.pdf and custom training loss documentation: * https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html

Parameters:

Name	Type	Description	Default
`y_pred`	`ndarray`	array with the predictions	required
`dtrain`	`Dataset`	training dataset	required
`alpha`	`float`	loss function variable	required
`gamma`	`float`	loss function variable	required

Returns:

Name	Type	Description
`grad`	`float`	The value of the first order derivative (gradient) of the loss with respect to the elements of preds for each sample point.
`hess`	`float`	The value of the second order derivative (Hessian) of the loss with respect to the elements of preds for each sample point.

Source code in src/utils.py

def lgbm_focal_loss(self, preds_raw: ndarray, lgbDataset: Dataset, alpha: float, gamma: float):
    """Adapation of the Focal Loss for lightgbm to be used as training loss.
    See original paper:
    * https://arxiv.org/pdf/1708.02002.pdf
    and custom training loss documentation:
    * https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html

    Args:
        y_pred (ndarray): array with the predictions
        dtrain (Dataset): training dataset
        alpha (float): loss function variable
        gamma (float): loss function variable

    Returns:
        grad (float): The value of the first order derivative (gradient) of the loss with
            respect to the elements of preds for each sample point.
        hess (float): The value of the second order derivative (Hessian) of the loss with
            respect to the elements of preds for each sample point.
    """
    y_true = lgbDataset.label
    # N observations x num_class arrays
    if self.n_class > 2:
        y_true = np.eye(self.n_class)[y_true.astype("int")]
        y_pred = preds_raw.reshape(-1, self.n_class, order="F")
    else:
        y_pred = preds_raw.astype("int")

    partial_fl = lambda x: self._focal_loss(x, y_true, alpha, gamma)
    grad = derivative(partial_fl, y_pred, n=1, dx=1e-6)
    hess = derivative(partial_fl, y_pred, n=2, dx=1e-6)
    if self.n_class > 2:
        return grad.flatten("F"), hess.flatten("F")
    else:
        return grad, hess

lgbm_focal_loss_eval ¶

lgbm_focal_loss_eval(preds_raw, lgbDataset, alpha, gamma)

Adapation of the Focal Loss for lightgbm to be used as evaluation loss. See original paper https://arxiv.org/pdf/1708.02002.pdf

Parameters:

Name	Type	Description	Default
`y_pred`	`ndarray`	array with the predictions	required
`dtrain`	`Dataset`	training dataset	required
`alpha`	`float`	loss function variable	required
`gamma`	`float`	loss function variable	required

Source code in src/utils.py

def lgbm_focal_loss_eval(self, preds_raw: ndarray, lgbDataset: Dataset, alpha: float, gamma: float):
    """Adapation of the Focal Loss for lightgbm to be used as evaluation loss.
    See original paper https://arxiv.org/pdf/1708.02002.pdf

    Args:
        y_pred (ndarray): array with the predictions
        dtrain (Dataset): training dataset
        alpha (float): loss function variable
        gamma (float): loss function variable

    Returns:
    """
    y_true = lgbDataset.label
    # N observations x num_class arrays
    if self.n_class > 2:
        y_true = np.eye(self.n_class)[y_true.astype("int")]
        y_pred = preds_raw.reshape(-1, self.n_class, order="F")
    else:
        y_pred = preds_raw

    loss = self._focal_loss(y_pred, y_true, alpha, gamma)
    result = ("focal_loss", np.mean(loss), False)
    return result

lgbm_precision ¶

lgbm_precision(preds_raw, lgbDataset)

Implementation of the precision score to be used as evaluation score for lightgbm. The adaptation is required since when using custom losses the row prediction needs to passed through a sigmoid to represent a probability.

Parameters:

Name	Type	Description	Default
`preds`	`ndarray`	predictions	required
`lgbDataset`	`lightgbm.Dataset`	dataset, containing labels, used for prediction	required

Returns:

Name	Type	Description
`result`	`tuple`	tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)

Source code in src/utils.py

def lgbm_precision(self, preds_raw: ndarray, lgbDataset: Dataset):
    """Implementation of the precision score to be used as evaluation
    score for lightgbm. The adaptation is required since when using custom losses
    the row prediction needs to passed through a sigmoid to represent a
    probability.

    Args:
        preds (ndarray): predictions
        lgbDataset (lightgbm.Dataset): dataset, containing labels, used for prediction

    Returns:
        result (tuple): tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)
    """
    y_true, preds = self._prediction(preds_raw=preds_raw, lgbDataset=lgbDataset)
    result = ("precision", recall_score(y_true, preds, average="weighted"), True)
    return result

lgbm_recall ¶

lgbm_recall(preds_raw, lgbDataset)

Implementation of the recall score to be used as evaluation score for lightgbm. The adaptation is required since when using custom losses the row prediction needs to passed through a sigmoid to represent a probability.

Parameters:

Name	Type	Description	Default
`preds`	`ndarray`	predictions	required
`lgbDataset`	`lightgbm.Dataset`	dataset, containing labels, used for prediction	required

Returns:

Name	Type	Description
`result`	`tuple`	tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)

Source code in src/utils.py

def lgbm_recall(self, preds_raw: ndarray, lgbDataset: Dataset):
    """Implementation of the recall score to be used as evaluation
    score for lightgbm. The adaptation is required since when using custom losses
    the row prediction needs to passed through a sigmoid to represent a
    probability.

    Args:
        preds (ndarray): predictions
        lgbDataset (lightgbm.Dataset): dataset, containing labels, used for prediction

    Returns:
        result (tuple): tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)
    """
    y_true, preds = self._prediction(preds_raw=preds_raw, lgbDataset=lgbDataset)
    result = ("recall", precision_score(y_true, preds, average="weighted"), True)
    return result

dl_design ¶

dl_design(input_layer, n_hidden_layers, output_layer, design='funnel')

Class with predefined deep learning hidden layer architectures. Especially usefull during hyper parameter tuning using Weights&Biases and RayTune to track effect architecture design on metrics. Predefined architecture designs are : ["funnel", "pipe", "anti_autoencoder", "trapezoid", "anti_trapezoid", "adj_funnel", "apollo"].

Parameters:

Name	Type	Description	Default
`input_layer`	`int`	size of input layer	required
`n_hidden_layers`	`int`	number of hidden layers	required
`output_layer`	`int`	size of input layer	required
`design`	`str`	type of design	`'funnel'`

Returns:

Name	Type	Description
`hidden_layers`	`list`	list of hidden layers

Source code in src/utils.py

def __init__(
    self,
    input_layer: int,
    n_hidden_layers: int,
    output_layer: int,
    design: Literal[
        "funnel",
        "pipe",
        "anti_autoencoder",
        "trapezoid",
        "anti_trapezoid",
        "adj_funnel",
        "apollo",
    ] = "funnel",
):
    self.design = design
    self.input_layer = input_layer
    self.n_hidden_layers = n_hidden_layers
    self.output_layer = output_layer

dl_train_prep ¶

dl_train_prep(data_train, data_valid, identifier, cont_cols, target_col)

Aggregator method to prepare the data for deep models trained in pytorch-widedeep library. DISCLAIMER!!! This method uses latest - not merged, additions to pytorch_widedeep library.

Parameters:

Name	Type	Description	Default
`identifier`	`str`	identifier column	required
`data_train`	`DataFrame`	training dataset	required
`data_valid`	`DataFrame`	validation dataset	required
`cont_cols`	`list`	list of conitunous feature columns in the dataset	required
`target_col`	`str`	column with predicted value	required

Returns:

Name	Type	Description
`X_train`	`dict`	training dataset dictionary
`X_valid`	`dict`	validation dataset dictionary
`tab_preprocessor`	`TabPreprocessor`	deep tabular dataset preprocessor

Source code in src/utils.py

def dl_train_prep(
    data_train: DataFrame,
    data_valid: DataFrame,
    identifier: str,
    cont_cols: list,
    target_col: str,
):
    """Aggregator method to prepare the data for deep models trained in pytorch-widedeep library.
    DISCLAIMER!!!
    This method uses latest - not merged, additions to pytorch_widedeep library.

    Args:
        identifier (str): identifier column
        data_train (DataFrame): training dataset
        data_valid (DataFrame): validation dataset
        cont_cols (list): list of conitunous feature columns in the dataset
        target_col (str): column with predicted value

    Returns:
        X_train (dict): training dataset dictionary
        X_valid (dict): validation dataset dictionary
        tab_preprocessor (TabPreprocessor): deep tabular dataset preprocessor
    """
    tab_preprocessor = TabPreprocessor(
        embedding_rule=embedding_rule,
        embed_cols=cat_cols,
        continuous_cols=cont_cols,
        shared_embed=False,
        scale=False,
    )

    X_tab_train = tab_preprocessor.fit_transform(data_train.drop(columns=[identifier]))
    X_tab_valid = tab_preprocessor.transform(data_valid.drop(columns=[identifier]))

    Y_train = data_train[target_col].values
    Y_valid = data_valid[target_col].values

    X_train = {"X_tab": X_tab_train, "target": Y_train}
    X_valid = {"X_tab": X_tab_valid, "target": Y_valid}

    return X_train, X_valid, tab_preprocessor

dl_metrics ¶

dl_metrics(n_classes=None)

Auxiliar method to define metrics tracked during trining of deep learning models.

Parameters:

Name	Type	Description	Default
`n_classes`	`int`	number of classes in case of tasks ['binary', 'multiclass']	`None`

Returns:

Name	Type	Description
`metrics_list`	`list`	list of metrics tracked during training of deep learning model

Source code in src/utils.py

def dl_metrics(
    n_classes: Union[int, None] = None,
):
    """Auxiliar method to define metrics tracked during trining of deep learning models.

    Args:
        n_classes (int): number of classes in case of tasks ['binary', 'multiclass']

    Returns:
        metrics_list (list): list of metrics tracked during training of deep learning model
    """
    accuracy = Accuracy(average=None, num_classes=n_classes)
    precision = Precision(average="micro", num_classes=n_classes)
    f1 = F1Score(average=None, num_classes=n_classes)
    recall = Recall(average=None, num_classes=n_classes)

    metrics_list = [accuracy, precision, f1, recall]
    return metrics_list

dl_predict ¶

dl_predict(data, model, tab_preprocessor, wide_preprocessor=None)

Aggregator method to predict target value from pandas Dataframe using pretrained deep learning model.

Parameters:

Name	Type	Description	Default
`model`	`WideDeep`	pretained model	required
`tab_preprocessor`	`TabPreprocessor`	deep tabular dataset preprocessor	required
`wide_preprocessor`	`WidePreprocessor`	wide tabular dataset preprocessor	`None`

Returns:

Name	Type	Description
`preds`	`ndarray`	predictions

Source code in src/utils.py

def dl_predict(
    data: DataFrame,
    model: WideDeep,
    tab_preprocessor: TabPreprocessor,
    wide_preprocessor: Union[WidePreprocessor, None] = None,
):
    """Aggregator method to predict target value from pandas Dataframe using pretrained deep learning model.

    Args:
        model (WideDeep): pretained model
        tab_preprocessor (TabPreprocessor): deep tabular dataset preprocessor
        wide_preprocessor (WidePreprocessor): wide tabular dataset preprocessor

    Returns:
        preds (ndarray): predictions
    """
    if wide_preprocessor:
        X_wide = wide_preprocessor.transform(data)
    else:
        X_wide = None
    X_tab = tab_preprocessor.transform(data)
    preds = model.predict(X_wide=X_wide, X_tab=X_tab)
    return preds