Utils¶
Util functions
scaler_mapper ¶
scaler_mapper(cont_cols, target_col, identifier, scaler_mapper_def=None)
Function that maps scaler functions to appropriate columns. By default assigns scaler to continuous feature columns . This behavior can be changed by scaler_mapper_def. Only columns defined in mapper object will be present in the transformed dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cont_cols |
list
|
list of continuousl feature columns in the dataset |
required |
target_col |
str
|
target column |
required |
identifier |
str
|
identifier column |
required |
scaler_mapper_def |
dict
|
optional dictionary that contains keys ['cont_cols', 'target_col', 'identifier_col'] with their corresponding scaler functions from sklearn library |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
scaler_mapper |
DataFrameMapper
|
scaler object mapping sklearn scalers to columns in pandas dataframe |
Source code in src/utils.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | |
optimize_df ¶
optimize_df(df, identifier, verbose=True)
Simple function to assign approporiate columns data types in pandas DataFrame
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
dataset |
required |
identifier |
str
|
identifier column |
required |
cat_cols |
list
|
list of categorical feature columns in the dataset |
required |
verbose |
boolean
|
option to show reduced memory usage |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
data |
DataFrame
|
optimized dataset |
Source code in src/utils.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | |
LGBM_custom_score ¶
LGBM_custom_score(n_class)
Class defining evaluation scores in case fobj, ie. focal loss is defined in LighGBM model training. From documentation: 'The predicted values. If fobj is specified, predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task in this case.'
Source code in src/utils.py
83 84 | |
lgbm_accuracy ¶
lgbm_accuracy(preds_raw, lgbDataset)
Implementation of the accuracy score to be used as evaluation score for lightgbm. The adaptation is required since when using custom losses the row prediction needs to passed through a sigmoid to represent a probability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preds |
ndarray
|
predictions |
required |
lgbDataset |
lightgbm.Dataset
|
dataset, containing labels, used for prediction |
required |
Returns:
| Name | Type | Description |
|---|---|---|
result |
tuple
|
tuple containing name of the score, its value and bool value for LighGBM (is_higher_better) |
Source code in src/utils.py
235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | |
lgbm_f1 ¶
lgbm_f1(preds_raw, lgbDataset)
Implementation of the f1 score to be used as evaluation score for lightgbm see feval documentation. The adaptation is required since when using custom losses the row prediction needs to passed through a sigmoid to represent a probability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preds |
ndarray
|
predictions |
required |
lgbDataset |
lightgbm.Dataset
|
dataset, containing labels, used for prediction |
required |
Returns:
| Name | Type | Description |
|---|---|---|
result |
tuple
|
tuple containing name of the score, its value and bool value for LighGBM (is_higher_better) |
Source code in src/utils.py
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | |
lgbm_focal_loss ¶
lgbm_focal_loss(preds_raw, lgbDataset, alpha, gamma)
Adapation of the Focal Loss for lightgbm to be used as training loss. See original paper: * https://arxiv.org/pdf/1708.02002.pdf and custom training loss documentation: * https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_pred |
ndarray
|
array with the predictions |
required |
dtrain |
Dataset
|
training dataset |
required |
alpha |
float
|
loss function variable |
required |
gamma |
float
|
loss function variable |
required |
Returns:
| Name | Type | Description |
|---|---|---|
grad |
float
|
The value of the first order derivative (gradient) of the loss with respect to the elements of preds for each sample point. |
hess |
float
|
The value of the second order derivative (Hessian) of the loss with respect to the elements of preds for each sample point. |
Source code in src/utils.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | |
lgbm_focal_loss_eval ¶
lgbm_focal_loss_eval(preds_raw, lgbDataset, alpha, gamma)
Adapation of the Focal Loss for lightgbm to be used as evaluation loss. See original paper https://arxiv.org/pdf/1708.02002.pdf
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_pred |
ndarray
|
array with the predictions |
required |
dtrain |
Dataset
|
training dataset |
required |
alpha |
float
|
loss function variable |
required |
gamma |
float
|
loss function variable |
required |
Source code in src/utils.py
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | |
lgbm_precision ¶
lgbm_precision(preds_raw, lgbDataset)
Implementation of the precision score to be used as evaluation score for lightgbm. The adaptation is required since when using custom losses the row prediction needs to passed through a sigmoid to represent a probability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preds |
ndarray
|
predictions |
required |
lgbDataset |
lightgbm.Dataset
|
dataset, containing labels, used for prediction |
required |
Returns:
| Name | Type | Description |
|---|---|---|
result |
tuple
|
tuple containing name of the score, its value and bool value for LighGBM (is_higher_better) |
Source code in src/utils.py
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | |
lgbm_recall ¶
lgbm_recall(preds_raw, lgbDataset)
Implementation of the recall score to be used as evaluation score for lightgbm. The adaptation is required since when using custom losses the row prediction needs to passed through a sigmoid to represent a probability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preds |
ndarray
|
predictions |
required |
lgbDataset |
lightgbm.Dataset
|
dataset, containing labels, used for prediction |
required |
Returns:
| Name | Type | Description |
|---|---|---|
result |
tuple
|
tuple containing name of the score, its value and bool value for LighGBM (is_higher_better) |
Source code in src/utils.py
218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 | |
dl_design ¶
dl_design(input_layer, n_hidden_layers, output_layer, design='funnel')
Class with predefined deep learning hidden layer architectures. Especially usefull during hyper parameter tuning using Weights&Biases and RayTune to track effect architecture design on metrics. Predefined architecture designs are : ["funnel", "pipe", "anti_autoencoder", "trapezoid", "anti_trapezoid", "adj_funnel", "apollo"].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_layer |
int
|
size of input layer |
required |
n_hidden_layers |
int
|
number of hidden layers |
required |
output_layer |
int
|
size of input layer |
required |
design |
str
|
type of design |
'funnel'
|
Returns:
| Name | Type | Description |
|---|---|---|
hidden_layers |
list
|
list of hidden layers |
Source code in src/utils.py
269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 | |
dl_train_prep ¶
dl_train_prep(data_train, data_valid, identifier, cont_cols, target_col)
Aggregator method to prepare the data for deep models trained in pytorch-widedeep library. DISCLAIMER!!! This method uses latest - not merged, additions to pytorch_widedeep library.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
identifier |
str
|
identifier column |
required |
data_train |
DataFrame
|
training dataset |
required |
data_valid |
DataFrame
|
validation dataset |
required |
cont_cols |
list
|
list of conitunous feature columns in the dataset |
required |
target_col |
str
|
column with predicted value |
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_train |
dict
|
training dataset dictionary |
X_valid |
dict
|
validation dataset dictionary |
tab_preprocessor |
TabPreprocessor
|
deep tabular dataset preprocessor |
Source code in src/utils.py
340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 | |
dl_metrics ¶
dl_metrics(n_classes=None)
Auxiliar method to define metrics tracked during trining of deep learning models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_classes |
int
|
number of classes in case of tasks ['binary', 'multiclass'] |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
metrics_list |
list
|
list of metrics tracked during training of deep learning model |
Source code in src/utils.py
383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 | |
dl_predict ¶
dl_predict(data, model, tab_preprocessor, wide_preprocessor=None)
Aggregator method to predict target value from pandas Dataframe using pretrained deep learning model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
WideDeep
|
pretained model |
required |
tab_preprocessor |
TabPreprocessor
|
deep tabular dataset preprocessor |
required |
wide_preprocessor |
WidePreprocessor
|
wide tabular dataset preprocessor |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
preds |
ndarray
|
predictions |
Source code in src/utils.py
403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 | |