Skip to content

Utils

Util functions

scaler_mapper

scaler_mapper(cont_cols, target_col, identifier, scaler_mapper_def=None)

Function that maps scaler functions to appropriate columns. By default assigns scaler to continuous feature columns . This behavior can be changed by scaler_mapper_def. Only columns defined in mapper object will be present in the transformed dataset.

Parameters:

Name Type Description Default
cont_cols list

list of continuousl feature columns in the dataset

required
target_col str

target column

required
identifier str

identifier column

required
scaler_mapper_def dict

optional dictionary that contains keys ['cont_cols', 'target_col', 'identifier_col'] with their corresponding scaler functions from sklearn library

None

Returns:

Name Type Description
scaler_mapper DataFrameMapper

scaler object mapping sklearn scalers to columns in pandas dataframe

Source code in src/utils.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def scaler_mapper(
    cont_cols: List[str],
    target_col: str,
    identifier: str,
    scaler_mapper_def: Union[dict, None] = None,
):
    """Function that maps scaler functions to appropriate columns. By default assigns scaler to continuous feature columns
    . This behavior can be changed by scaler_mapper_def.
    Only columns defined in mapper object will be present in the transformed dataset.

    Args:
        cont_cols (list): list of continuousl feature columns in the dataset
        target_col (str): target column
        identifier (str): identifier column
        scaler_mapper_def (dict): optional dictionary that contains keys ['cont_cols', 'target_col',
            'identifier_col'] with their corresponding scaler functions from sklearn library

    Returns:
        scaler_mapper (DataFrameMapper): scaler object mapping sklearn scalers to columns in pandas dataframe
    """
    if scaler_mapper_def is None:
        cont_cols_def = gen_features(columns=list(map(lambda x: [x], cont_cols)), classes=[StandardScaler])

        target_col_def = [([target_col], None, {})]
        identifier_def = [([identifier], None, {})]

    else:
        cont_cols_def = gen_features(
            columns=list(map(lambda x: [x], cont_cols)),
            classes=[scaler_mapper_def["cont_cols"]],
        )

        target_col_def = [([target_col], scaler_mapper_def["target_col"], {})]
        identifier_def = [([identifier], scaler_mapper_def["identifier_col"], {})]

    scaler_mapper = DataFrameMapper(cont_cols_def + target_col_def + identifier_def, df_out=True)
    return scaler_mapper

optimize_df

optimize_df(df, identifier, verbose=True)

Simple function to assign approporiate columns data types in pandas DataFrame

Parameters:

Name Type Description Default
df DataFrame

dataset

required
identifier str

identifier column

required
cat_cols list

list of categorical feature columns in the dataset

required
verbose boolean

option to show reduced memory usage

True

Returns:

Name Type Description
data DataFrame

optimized dataset

Source code in src/utils.py
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def optimize_df(df: DataFrame, identifier: str, verbose: bool = True):
    """Simple function to assign approporiate columns data types in pandas DataFrame

    Args:
        df (DataFrame): dataset
        identifier (str): identifier column
        cat_cols (list): list of categorical feature columns in the dataset
        verbose (boolean): option to show reduced memory usage

    Returns:
        data (DataFrame): optimized dataset
    """
    data = df.convert_dtypes()
    data[identifier] = data[identifier].astype(str)
    if verbose:
        reduction = (1 - (data.memory_usage(deep=True).sum() / df.memory_usage(deep=True).sum())) * 100
        print(f"Memory usage reduced by {reduction:0.2f}%")
    return data

LGBM_custom_score

LGBM_custom_score(n_class)

Class defining evaluation scores in case fobj, ie. focal loss is defined in LighGBM model training. From documentation: 'The predicted values. If fobj is specified, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.'

Source code in src/utils.py
83
84
def __init__(self, n_class: int):
    self.n_class = n_class

lgbm_accuracy

lgbm_accuracy(preds_raw, lgbDataset)

Implementation of the accuracy score to be used as evaluation score for lightgbm. The adaptation is required since when using custom losses the row prediction needs to passed through a sigmoid to represent a probability.

Parameters:

Name Type Description Default
preds ndarray

predictions

required
lgbDataset lightgbm.Dataset

dataset, containing labels, used for prediction

required

Returns:

Name Type Description
result tuple

tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)

Source code in src/utils.py
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
def lgbm_accuracy(self, preds_raw: ndarray, lgbDataset: Dataset):
    """Implementation of the accuracy score to be used as evaluation
    score for lightgbm. The adaptation is required since when using custom losses
    the row prediction needs to passed through a sigmoid to represent a
    probability.

    Args:
        preds (ndarray): predictions
        lgbDataset (lightgbm.Dataset): dataset, containing labels, used for prediction

    Returns:
        result (tuple): tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)
    """
    y_true, preds = self._prediction(preds_raw=preds_raw, lgbDataset=lgbDataset)
    result = ("accuracy", accuracy_score(y_true, preds), True)
    return result

lgbm_f1

lgbm_f1(preds_raw, lgbDataset)

Implementation of the f1 score to be used as evaluation score for lightgbm see feval documentation. The adaptation is required since when using custom losses the row prediction needs to passed through a sigmoid to represent a probability.

Parameters:

Name Type Description Default
preds ndarray

predictions

required
lgbDataset lightgbm.Dataset

dataset, containing labels, used for prediction

required

Returns:

Name Type Description
result tuple

tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)

Source code in src/utils.py
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
def lgbm_f1(self, preds_raw: ndarray, lgbDataset: Dataset):
    """Implementation of the f1 score to be used as evaluation score for lightgbm
    see feval [documentation](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html).
    The adaptation is required since when using custom losses
    the row prediction needs to passed through a sigmoid to represent a
    probability.

    Args:
        preds (ndarray): predictions
        lgbDataset (lightgbm.Dataset): dataset, containing labels, used for prediction

    Returns:
        result (tuple): tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)
    """
    y_true, preds = self._prediction(preds_raw=preds_raw, lgbDataset=lgbDataset)
    result = ("f1", f1_score(y_true, preds, average="weighted"), True)
    return result

lgbm_focal_loss

lgbm_focal_loss(preds_raw, lgbDataset, alpha, gamma)

Adapation of the Focal Loss for lightgbm to be used as training loss. See original paper: * https://arxiv.org/pdf/1708.02002.pdf and custom training loss documentation: * https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html

Parameters:

Name Type Description Default
y_pred ndarray

array with the predictions

required
dtrain Dataset

training dataset

required
alpha float

loss function variable

required
gamma float

loss function variable

required

Returns:

Name Type Description
grad float

The value of the first order derivative (gradient) of the loss with respect to the elements of preds for each sample point.

hess float

The value of the second order derivative (Hessian) of the loss with respect to the elements of preds for each sample point.

Source code in src/utils.py
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
def lgbm_focal_loss(self, preds_raw: ndarray, lgbDataset: Dataset, alpha: float, gamma: float):
    """Adapation of the Focal Loss for lightgbm to be used as training loss.
    See original paper:
    * https://arxiv.org/pdf/1708.02002.pdf
    and custom training loss documentation:
    * https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html

    Args:
        y_pred (ndarray): array with the predictions
        dtrain (Dataset): training dataset
        alpha (float): loss function variable
        gamma (float): loss function variable

    Returns:
        grad (float): The value of the first order derivative (gradient) of the loss with
            respect to the elements of preds for each sample point.
        hess (float): The value of the second order derivative (Hessian) of the loss with
            respect to the elements of preds for each sample point.
    """
    y_true = lgbDataset.label
    # N observations x num_class arrays
    if self.n_class > 2:
        y_true = np.eye(self.n_class)[y_true.astype("int")]
        y_pred = preds_raw.reshape(-1, self.n_class, order="F")
    else:
        y_pred = preds_raw.astype("int")

    partial_fl = lambda x: self._focal_loss(x, y_true, alpha, gamma)
    grad = derivative(partial_fl, y_pred, n=1, dx=1e-6)
    hess = derivative(partial_fl, y_pred, n=2, dx=1e-6)
    if self.n_class > 2:
        return grad.flatten("F"), hess.flatten("F")
    else:
        return grad, hess

lgbm_focal_loss_eval

lgbm_focal_loss_eval(preds_raw, lgbDataset, alpha, gamma)

Adapation of the Focal Loss for lightgbm to be used as evaluation loss. See original paper https://arxiv.org/pdf/1708.02002.pdf

Parameters:

Name Type Description Default
y_pred ndarray

array with the predictions

required
dtrain Dataset

training dataset

required
alpha float

loss function variable

required
gamma float

loss function variable

required
Source code in src/utils.py
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
def lgbm_focal_loss_eval(self, preds_raw: ndarray, lgbDataset: Dataset, alpha: float, gamma: float):
    """Adapation of the Focal Loss for lightgbm to be used as evaluation loss.
    See original paper https://arxiv.org/pdf/1708.02002.pdf

    Args:
        y_pred (ndarray): array with the predictions
        dtrain (Dataset): training dataset
        alpha (float): loss function variable
        gamma (float): loss function variable

    Returns:
    """
    y_true = lgbDataset.label
    # N observations x num_class arrays
    if self.n_class > 2:
        y_true = np.eye(self.n_class)[y_true.astype("int")]
        y_pred = preds_raw.reshape(-1, self.n_class, order="F")
    else:
        y_pred = preds_raw

    loss = self._focal_loss(y_pred, y_true, alpha, gamma)
    result = ("focal_loss", np.mean(loss), False)
    return result

lgbm_precision

lgbm_precision(preds_raw, lgbDataset)

Implementation of the precision score to be used as evaluation score for lightgbm. The adaptation is required since when using custom losses the row prediction needs to passed through a sigmoid to represent a probability.

Parameters:

Name Type Description Default
preds ndarray

predictions

required
lgbDataset lightgbm.Dataset

dataset, containing labels, used for prediction

required

Returns:

Name Type Description
result tuple

tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)

Source code in src/utils.py
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
def lgbm_precision(self, preds_raw: ndarray, lgbDataset: Dataset):
    """Implementation of the precision score to be used as evaluation
    score for lightgbm. The adaptation is required since when using custom losses
    the row prediction needs to passed through a sigmoid to represent a
    probability.

    Args:
        preds (ndarray): predictions
        lgbDataset (lightgbm.Dataset): dataset, containing labels, used for prediction

    Returns:
        result (tuple): tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)
    """
    y_true, preds = self._prediction(preds_raw=preds_raw, lgbDataset=lgbDataset)
    result = ("precision", recall_score(y_true, preds, average="weighted"), True)
    return result

lgbm_recall

lgbm_recall(preds_raw, lgbDataset)

Implementation of the recall score to be used as evaluation score for lightgbm. The adaptation is required since when using custom losses the row prediction needs to passed through a sigmoid to represent a probability.

Parameters:

Name Type Description Default
preds ndarray

predictions

required
lgbDataset lightgbm.Dataset

dataset, containing labels, used for prediction

required

Returns:

Name Type Description
result tuple

tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)

Source code in src/utils.py
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
def lgbm_recall(self, preds_raw: ndarray, lgbDataset: Dataset):
    """Implementation of the recall score to be used as evaluation
    score for lightgbm. The adaptation is required since when using custom losses
    the row prediction needs to passed through a sigmoid to represent a
    probability.

    Args:
        preds (ndarray): predictions
        lgbDataset (lightgbm.Dataset): dataset, containing labels, used for prediction

    Returns:
        result (tuple): tuple containing name of the score, its value and bool value for LighGBM (is_higher_better)
    """
    y_true, preds = self._prediction(preds_raw=preds_raw, lgbDataset=lgbDataset)
    result = ("recall", precision_score(y_true, preds, average="weighted"), True)
    return result

dl_design

dl_design(input_layer, n_hidden_layers, output_layer, design='funnel')

Class with predefined deep learning hidden layer architectures. Especially usefull during hyper parameter tuning using Weights&Biases and RayTune to track effect architecture design on metrics. Predefined architecture designs are : ["funnel", "pipe", "anti_autoencoder", "trapezoid", "anti_trapezoid", "adj_funnel", "apollo"].

Parameters:

Name Type Description Default
input_layer int

size of input layer

required
n_hidden_layers int

number of hidden layers

required
output_layer int

size of input layer

required
design str

type of design

'funnel'

Returns:

Name Type Description
hidden_layers list

list of hidden layers

Source code in src/utils.py
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
def __init__(
    self,
    input_layer: int,
    n_hidden_layers: int,
    output_layer: int,
    design: Literal[
        "funnel",
        "pipe",
        "anti_autoencoder",
        "trapezoid",
        "anti_trapezoid",
        "adj_funnel",
        "apollo",
    ] = "funnel",
):
    self.design = design
    self.input_layer = input_layer
    self.n_hidden_layers = n_hidden_layers
    self.output_layer = output_layer

dl_train_prep

dl_train_prep(data_train, data_valid, identifier, cont_cols, target_col)

Aggregator method to prepare the data for deep models trained in pytorch-widedeep library. DISCLAIMER!!! This method uses latest - not merged, additions to pytorch_widedeep library.

Parameters:

Name Type Description Default
identifier str

identifier column

required
data_train DataFrame

training dataset

required
data_valid DataFrame

validation dataset

required
cont_cols list

list of conitunous feature columns in the dataset

required
target_col str

column with predicted value

required

Returns:

Name Type Description
X_train dict

training dataset dictionary

X_valid dict

validation dataset dictionary

tab_preprocessor TabPreprocessor

deep tabular dataset preprocessor

Source code in src/utils.py
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
def dl_train_prep(
    data_train: DataFrame,
    data_valid: DataFrame,
    identifier: str,
    cont_cols: list,
    target_col: str,
):
    """Aggregator method to prepare the data for deep models trained in pytorch-widedeep library.
    DISCLAIMER!!!
    This method uses latest - not merged, additions to pytorch_widedeep library.

    Args:
        identifier (str): identifier column
        data_train (DataFrame): training dataset
        data_valid (DataFrame): validation dataset
        cont_cols (list): list of conitunous feature columns in the dataset
        target_col (str): column with predicted value

    Returns:
        X_train (dict): training dataset dictionary
        X_valid (dict): validation dataset dictionary
        tab_preprocessor (TabPreprocessor): deep tabular dataset preprocessor
    """
    tab_preprocessor = TabPreprocessor(
        embedding_rule=embedding_rule,
        embed_cols=cat_cols,
        continuous_cols=cont_cols,
        shared_embed=False,
        scale=False,
    )

    X_tab_train = tab_preprocessor.fit_transform(data_train.drop(columns=[identifier]))
    X_tab_valid = tab_preprocessor.transform(data_valid.drop(columns=[identifier]))

    Y_train = data_train[target_col].values
    Y_valid = data_valid[target_col].values

    X_train = {"X_tab": X_tab_train, "target": Y_train}
    X_valid = {"X_tab": X_tab_valid, "target": Y_valid}

    return X_train, X_valid, tab_preprocessor

dl_metrics

dl_metrics(n_classes=None)

Auxiliar method to define metrics tracked during trining of deep learning models.

Parameters:

Name Type Description Default
n_classes int

number of classes in case of tasks ['binary', 'multiclass']

None

Returns:

Name Type Description
metrics_list list

list of metrics tracked during training of deep learning model

Source code in src/utils.py
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
def dl_metrics(
    n_classes: Union[int, None] = None,
):
    """Auxiliar method to define metrics tracked during trining of deep learning models.

    Args:
        n_classes (int): number of classes in case of tasks ['binary', 'multiclass']

    Returns:
        metrics_list (list): list of metrics tracked during training of deep learning model
    """
    accuracy = Accuracy(average=None, num_classes=n_classes)
    precision = Precision(average="micro", num_classes=n_classes)
    f1 = F1Score(average=None, num_classes=n_classes)
    recall = Recall(average=None, num_classes=n_classes)

    metrics_list = [accuracy, precision, f1, recall]
    return metrics_list

dl_predict

dl_predict(data, model, tab_preprocessor, wide_preprocessor=None)

Aggregator method to predict target value from pandas Dataframe using pretrained deep learning model.

Parameters:

Name Type Description Default
model WideDeep

pretained model

required
tab_preprocessor TabPreprocessor

deep tabular dataset preprocessor

required
wide_preprocessor WidePreprocessor

wide tabular dataset preprocessor

None

Returns:

Name Type Description
preds ndarray

predictions

Source code in src/utils.py
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
def dl_predict(
    data: DataFrame,
    model: WideDeep,
    tab_preprocessor: TabPreprocessor,
    wide_preprocessor: Union[WidePreprocessor, None] = None,
):
    """Aggregator method to predict target value from pandas Dataframe using pretrained deep learning model.

    Args:
        model (WideDeep): pretained model
        tab_preprocessor (TabPreprocessor): deep tabular dataset preprocessor
        wide_preprocessor (WidePreprocessor): wide tabular dataset preprocessor

    Returns:
        preds (ndarray): predictions
    """
    if wide_preprocessor:
        X_wide = wide_preprocessor.transform(data)
    else:
        X_wide = None
    X_tab = tab_preprocessor.transform(data)
    preds = model.predict(X_wide=X_wide, X_tab=X_tab)
    return preds