1 Introduction¶

1.1 Paintshop¶

1.1.1 Current status¶

types of reports/datasets available
- Weekly – every Sunday
- On demand – every time you need data
requested months
- less faulty month
  - August 2019 - there were only 5 faults for Av. General Pendular according to thetab Diario_Binarias_Sum(006.TTS_T4_Diario_Averías.xlsx)
- most faulty month
  - August 2018 - all pendulars had faults except Av. Pendular Motriz 4(no faults in the sheet) according to the tab Diario_Binarias_Sum(006.TTS_T4_Diario_Averías.xlsx)

1.1.2 Used data in this report¶

Fireman_OnDemand_v1.4_082019.zip sent on 15/12/2020 by Elma Sanz(elma.sanz@seat.es)

1.1.3 Opened issues¶

missing 4 features in 2018 ['CONTADOR CARROCERIAS SALIDA PENDULAR TTS','NUMERO PENDULAR EN FALLO', 'VALOR TENSORA LADO DERECHO', 'VALOR TENSORA LADO IZQUIERDO']
inclination measuremnts not available:
1. provide the text messageges from the system + values above threshold that are stored
2. lower the threshold for measurement storing
3. install new or different sensors
explanation of BiDrac global features

1.2 Spindle¶

1.2.1 Current status¶

contact person/company - Manuel Suárez, Tyris (through Raul Palacios Trujillo, Seat)
types of datasets with TimeStamp intervals
- Transientdata 0.5s
- Rawdata 10s
number of features in final datasets
- Transientdata - as Transientdata 39
- Rawdata - as Rawdata 344
description of the features(TagIDs) was provided by Manuel Suarez TagId_Description.xlsx

1.2.2 Used data in this report¶

most recent and smallest in size spindle dataset available on LUT sharepoint wetransfer-8f07cc.zip

1.2.3 Opened issues¶

Rawdata 344, Transientdata 39 is not available yet (waiting for Tyris)
[ONHOLD] no faults were yet detected
[TENTATIVELY CLOSED] merge of multiple sql tables storing the data - not possible due to type of integration should be handled in post-processing by Tyris in future according to Manuel Suárez
[TENTATIVELY CLOSED]adjust the sql tables so the data is stored in columns, ie. 1 column per feature, and assign NA in case the values are currently not available - not possible due to type of integration, but this should be handled in post-processing by Tyris in future according to Manuel Suárez

In [134]:

            
                Copied!
                
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.axes_grid1 import make_axes_locatable
pd.options.display.max_rows = 4000
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.axes_grid1 import make_axes_locatable
pd.options.display.max_rows = 4000

2 Analysis¶

2.1 Paintshop¶

2.1.1 Features¶

read Values.csv
create datetime type column
clean dataframe
set index frequency to 60s to find missing values
provide initial description of the feature columns

In [96]:

            
                Copied!
                
dir_loc = '#datasets/paintshop/Fireman_OnDemand_v1.4_082019/'
Values_loc = dir_loc + 'Values.csv'
WarningsFaults_loc = dir_loc + 'WarningsFaults.csv'

values_df = pd.read_csv(Values_loc)
values_df['TS'] = pd.to_datetime(values_df['Fecha'] + ' ' + values_df['Hora Completa 24'], format="%d/%m/%y %H:%M:%S")
values_df.drop(columns=['Fecha', 'Hora Completa 24', 'Descripción Proceso - Señal BiDrac', 'Valor - Global'], inplace=True )

values_df = values_df.pivot(index='TS', columns='Descripción Proceso - Señal PMC', values='Value')
values_df = values_df.asfreq('60s')
dir_loc = '#datasets/paintshop/Fireman_OnDemand_v1.4_082019/'
Values_loc = dir_loc + 'Values.csv'
WarningsFaults_loc = dir_loc + 'WarningsFaults.csv'

values_df = pd.read_csv(Values_loc)
values_df['TS'] = pd.to_datetime(values_df['Fecha'] + ' ' + values_df['Hora Completa 24'], format="%d/%m/%y %H:%M:%S")
values_df.drop(columns=['Fecha', 'Hora Completa 24', 'Descripción Proceso - Señal BiDrac', 'Valor - Global'], inplace=True )

values_df = values_df.pivot(index='TS', columns='Descripción Proceso - Señal PMC', values='Value')
values_df = values_df.asfreq('60s')

/home/palo/miniconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3145: DtypeWarning: Columns (5) have mixed types.Specify dtype option on import or set low_memory=False.
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,

In [97]:

            
                Copied!
                
values_desc = values_df.describe()
values_desc.loc['NA[%]'] = values_df.isnull().sum()/len(values_df)
values_desc
values_desc = values_df.describe()
values_desc.loc['NA[%]'] = values_df.isnull().sum()/len(values_df)
values_desc

Out[97]:

Descripción Proceso - Señal PMC	CARGA PENDULAR UNIDADES	CONTADOR KW/H	CONTADOR CARROCERIAS SALIDA PENDULAR TTS	E-P01 CONSUMO ELECTRICO (KW/H)	EFICIENCIA Y DIA	ENTRADA CORTADA /PROCESO	FALTA CARROCERIA ENTRADA	NUMERO CARROCERIAS PRODUCIDAS	NUMERO PENDULAR EN FALLO	POR PARO TRANSPORTE	SUMA DE PAUSAS	TOTAL ASUMIDOS	VALOR TENSORA LADO DERECHO	VALOR TENSORA LADO IZQUIERDO	VELOCIDAD PENDULAR M/MIN
count	33542.000000	33617.000000	33541.000000	33557.000000	33542.000000	33542.000000	33542.000000	33542.000000	33541.000000	33542.000000	33542.000000	33542.000000	33541.000000	33541.000000	33542.000000
mean	16.891450	31.570634	1.190185	132.695773	521.225747	9.290203	122.702164	290.771331	0.035300	52.308091	0.000000	122.702164	420.930980	430.783459	0.000000
std	11.620719	25.664578	1.141323	234.978251	254.035586	24.261606	83.939681	174.341474	1.771892	127.413034	0.000000	83.939681	3.537655	5.133577	0.000000
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	411.000000	418.000000	0.000000
25%	6.000000	10.000000	0.000000	2.280000	452.000000	0.000000	37.000000	147.000000	0.000000	1.000000	0.000000	37.000000	418.000000	427.000000	0.000000
50%	18.000000	30.000000	1.000000	3.600000	547.000000	0.000000	140.000000	340.000000	0.000000	4.000000	0.000000	140.000000	421.000000	431.000000	0.000000
75%	25.750000	50.000000	2.000000	170.000000	641.000000	5.000000	200.000000	380.000000	0.000000	11.000000	0.000000	200.000000	424.000000	433.000000	0.000000
max	34.000000	100.000000	5.000000	990.000000	997.000000	125.000000	348.000000	601.000000	126.000000	480.000000	0.000000	348.000000	464.000000	452.000000	0.000000
NA[%]	0.248611	0.246931	0.248634	0.248275	0.248611	0.248611	0.248611	0.248611	0.248634	0.248611	0.248611	0.248611	0.248634	0.248634	0.248611

2.1.2 Labels¶

read dataset to pandas dataframe
check fraction of NA values and drop the NA rows
create datetime type column of start/end of faults with rounded time to 1 minute intervals
check types of faults
go through all faults and assign 1 to label column in each row where the fault was active
show and save final dataset

In [108]:

            
                Copied!
                
WarningsFaults_df = pd.read_csv(WarningsFaults_loc)
WarningsFaults_df.isnull().sum()/len(WarningsFaults_df)
WarningsFaults_df = pd.read_csv(WarningsFaults_loc)
WarningsFaults_df.isnull().sum()/len(WarningsFaults_df)

Out[108]:

Id Proceso - Aviso                 0.000000
Descripción Proceso - Aviso PMC    0.000000
Clasificación BiDrac-Aviso         0.000000
Clasificación Proceso - Aviso      0.000000
Fecha Inicio - Fecha Llegada       0.000000
Hora Inicio - Hora Llegada         0.000000
Fecha Fin - Fecha Salida           0.101512
Hora fin - Hora Salida             0.000000
Duración Neto - Avisos             0.000000
dtype: float64

In [110]:

            
                Copied!
                
WarningsFaults_df.dropna(inplace=True)
# time is rounded to minutes for easier merging with Values
WarningsFaults_df['TS Inicio'] = pd.to_datetime(WarningsFaults_df['Fecha Inicio - Fecha Llegada'] + ' ' + WarningsFaults_df['Hora Inicio - Hora Llegada'], format="%d/%m/%y %H:%M:%S").round('min')
WarningsFaults_df['TS Fin'] = pd.to_datetime(WarningsFaults_df['Fecha Fin - Fecha Salida'] + ' ' + WarningsFaults_df['Hora fin - Hora Salida'], format="%d/%m/%y %H:%M:%S").round('min')
WarningsFaults_df.drop(columns=['Fecha Inicio - Fecha Llegada', 'Hora Inicio - Hora Llegada', 'Fecha Fin - Fecha Salida', 'Hora fin - Hora Salida', 'Duración Neto - Avisos'], inplace=True )
WarningsFaults_df.dropna(inplace=True)
# time is rounded to minutes for easier merging with Values
WarningsFaults_df['TS Inicio'] = pd.to_datetime(WarningsFaults_df['Fecha Inicio - Fecha Llegada'] + ' ' + WarningsFaults_df['Hora Inicio - Hora Llegada'], format="%d/%m/%y %H:%M:%S").round('min')
WarningsFaults_df['TS Fin'] = pd.to_datetime(WarningsFaults_df['Fecha Fin - Fecha Salida'] + ' ' + WarningsFaults_df['Hora fin - Hora Salida'], format="%d/%m/%y %H:%M:%S").round('min')
WarningsFaults_df.drop(columns=['Fecha Inicio - Fecha Llegada', 'Hora Inicio - Hora Llegada', 'Fecha Fin - Fecha Salida', 'Hora fin - Hora Salida', 'Duración Neto - Avisos'], inplace=True )

In [111]:

            
                Copied!
                
WarningsFaults_df.nunique()
WarningsFaults_df.nunique()

Out[111]:

Id Proceso - Aviso                   7
Descripción Proceso - Aviso PMC      7
Clasificación BiDrac-Aviso           4
Clasificación Proceso - Aviso        2
TS Inicio                          231
TS Fin                             259
dtype: int64

In [112]:

            
                Copied!
                
WarningsFaults_df['Descripción Proceso - Aviso PMC'].unique()
WarningsFaults_df['Descripción Proceso - Aviso PMC'].unique()

Out[112]:

array(['ENTRADA CORTADA', 'TTS NO PRODUCE  PRODUCCION',
       'PENDULAR 11.3.2.0 AVERIA GENERAL',
       'DEFECTO INCLINACION  ZONA CAJA 4',
       'DEFECTO INCLINACION  ZONA CAJA 8',
       'DEFECTO INCLINACION PENDULOS TTS',
       'DEFECTO INCLINACION  ZONA CAJA 7'], dtype=object)

In [114]:

            
                Copied!
                
                    
                    
                
                

        
values_df[WarningsFaults_df['Descripción Proceso - Aviso PMC'].unique()] = 0
for i in range(len(WarningsFaults_df)):
    start = WarningsFaults_df.iloc[i]['TS Inicio']
    end = WarningsFaults_df.iloc[i]['TS Fin']
    col_name = WarningsFaults_df.iloc[i]['Descripción Proceso - Aviso PMC']
    values_df.loc[start:end, col_name] = 1
values_df[WarningsFaults_df['Descripción Proceso - Aviso PMC'].unique()] = 0
for i in range(len(WarningsFaults_df)):
    start = WarningsFaults_df.iloc[i]['TS Inicio']
    end = WarningsFaults_df.iloc[i]['TS Fin']
    col_name = WarningsFaults_df.iloc[i]['Descripción Proceso - Aviso PMC']
    values_df.loc[start:end, col_name] = 1

In [115]:

            
                Copied!
                
values_df.to_csv(dir_loc + 'dataset.csv')
values_df.head()
values_df.to_csv(dir_loc + 'dataset.csv')
values_df.head()

Out[115]:

Descripción Proceso - Señal PMC	CARGA PENDULAR UNIDADES	CONTADOR KW/H	CONTADOR CARROCERIAS SALIDA PENDULAR TTS	E-P01 CONSUMO ELECTRICO (KW/H)	EFICIENCIA Y DIA	ENTRADA CORTADA /PROCESO	FALTA CARROCERIA ENTRADA	NUMERO CARROCERIAS PRODUCIDAS	NUMERO PENDULAR EN FALLO	POR PARO TRANSPORTE	SUMA DE PAUSAS	TOTAL ASUMIDOS	VALOR TENSORA LADO DERECHO	VALOR TENSORA LADO IZQUIERDO	VELOCIDAD PENDULAR M/MIN	ENTRADA CORTADA	TTS NO PRODUCE PRODUCCION	PENDULAR 11.3.2.0 AVERIA GENERAL	DEFECTO INCLINACION ZONA CAJA 4	DEFECTO INCLINACION ZONA CAJA 8	DEFECTO INCLINACION PENDULOS TTS	DEFECTO INCLINACION ZONA CAJA 7
TS
2019-08-01 00:00:00	18.0	20.0	3.0	1.20	532.0	5.0	56.0	98.0	0.0	0.0	0.0	56.0	416.0	428.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2019-08-01 00:01:00	18.0	20.0	3.0	1.20	537.0	5.0	56.0	100.0	0.0	0.0	0.0	56.0	418.0	429.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2019-08-01 00:02:00	18.0	20.0	2.0	1.21	552.0	5.0	56.0	101.0	0.0	0.0	0.0	56.0	417.0	430.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2019-08-01 00:03:00	20.0	20.0	1.0	1.22	543.0	5.0	56.0	102.0	0.0	0.0	0.0	56.0	415.0	430.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2019-08-01 00:04:00	20.0	20.0	0.0	1.23	548.0	5.0	56.0	103.0	0.0	0.0	0.0	56.0	415.0	428.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

2.1.3 Graphs¶

timeseries plot of each feature and label
sorted features by variance - in general, higher variance means higher probability that the feature will be usefull for classification
correlation heatmap - cross correlation between all features
summary of label data - we need more faults (request more data) so the classification would make sense

In [116]:

            
                Copied!
                
values_df.plot(subplots=True, figsize=(10,30));
values_df.plot(subplots=True, figsize=(10,30));

In [127]:

            
                Copied!
                
fig, ax = plt.subplots(1, 1, figsize=(16, 5))
fig.suptitle('Sorted features by their variance', size=20)

features = values_df.loc[:, ~values_df.columns.isin(WarningsFaults_df['Descripción Proceso - Aviso PMC'].unique())]
features_std = features.std().to_frame().rename(columns = {'index': 'feature', 0: 'std'})
features_std = features_std.sort_values('std', ascending = False)

bar_plot = features_std.plot.bar(ax=ax, legend=False, color ='black', width=0.6)
ax.tick_params(labelsize=12)
ax.set_xlabel('sorted features',size=12);
fig, ax = plt.subplots(1, 1, figsize=(16, 5))
fig.suptitle('Sorted features by their variance', size=20)

features = values_df.loc[:, ~values_df.columns.isin(WarningsFaults_df['Descripción Proceso - Aviso PMC'].unique())]
features_std = features.std().to_frame().rename(columns = {'index': 'feature', 0: 'std'})
features_std = features_std.sort_values('std', ascending = False)

bar_plot = features_std.plot.bar(ax=ax, legend=False, color ='black', width=0.6)
ax.tick_params(labelsize=12)
ax.set_xlabel('sorted features',size=12);

In [139]:

            
                Copied!
                
                    
                    
                
                

        
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
fig.suptitle('Correlation matrix heatmap', size=20, y=1.25)

features_corr = features.corr()

heatmap = ax.matshow(features_corr, cmap=plt.cm.seismic)
ax.set_xticks(range(features_corr.shape[1]))
ax.set_xticklabels(features_corr.columns.values, fontsize=10, rotation=90)
ax.set_yticks(range(features_corr.shape[0]))
ax.set_yticklabels(features_corr.index.values, fontsize=10)

ax.set_xticks(np.arange(-.5, features_corr.shape[1]-1, 1), minor=True);
ax.set_yticks(np.arange(-.5, features_corr.shape[0]-1, 1), minor=True);
ax.grid(which='minor', color='w', linestyle='-', linewidth=2.)
# create an axes on the right side of ax. The width of cax will be 5%
# of ax and the padding between cax and ax will be fixed at 0.05 inch.
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="2%", pad=0.1)
cb = fig.colorbar(heatmap, cax=cax)
cb.ax.tick_params(labelsize=12)
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
fig.suptitle('Correlation matrix heatmap', size=20, y=1.25)

features_corr = features.corr()

heatmap = ax.matshow(features_corr, cmap=plt.cm.seismic)
ax.set_xticks(range(features_corr.shape[1]))
ax.set_xticklabels(features_corr.columns.values, fontsize=10, rotation=90)
ax.set_yticks(range(features_corr.shape[0]))
ax.set_yticklabels(features_corr.index.values, fontsize=10)

ax.set_xticks(np.arange(-.5, features_corr.shape[1]-1, 1), minor=True);
ax.set_yticks(np.arange(-.5, features_corr.shape[0]-1, 1), minor=True);
ax.grid(which='minor', color='w', linestyle='-', linewidth=2.)
# create an axes on the right side of ax. The width of cax will be 5%
# of ax and the padding between cax and ax will be fixed at 0.05 inch.
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="2%", pad=0.1)
cb = fig.colorbar(heatmap, cax=cax)
cb.ax.tick_params(labelsize=12)

In [154]:

            
                Copied!
                
labels = values_df.loc[:, values_df.columns.isin(WarningsFaults_df['Descripción Proceso - Aviso PMC'].unique())]
print('Fraction of samples that has more than 1 label:\n' + str(len([x for x in labels.sum(axis=1) if x > 0])/len(labels)))
labels = values_df.loc[:, values_df.columns.isin(WarningsFaults_df['Descripción Proceso - Aviso PMC'].unique())]
print('Fraction of samples that has more than 1 label:\n' + str(len([x for x in labels.sum(axis=1) if x > 0])/len(labels)))

Fraction of samples that has more than 1 label:
0.022087813620071686

In [166]:

            
                Copied!
                
print('Fraction of normal measurements :\n' + str(len([x for x in labels.sum(axis=1) if x == 0])/len(labels)) +
      '\n\nFraction of measurements assigned to each label:\n' + str(labels.sum()/len(labels)))
print('Fraction of normal measurements :\n' + str(len([x for x in labels.sum(axis=1) if x == 0])/len(labels)) +
      '\n\nFraction of measurements assigned to each label:\n' + str(labels.sum()/len(labels)))

Fraction of normal measurements :
0.9779121863799283

Fraction of measurements assigned to each label:
Descripción Proceso - Señal PMC
ENTRADA CORTADA                     0.018884
TTS NO PRODUCE  PRODUCCION          0.016622
PENDULAR 11.3.2.0 AVERIA GENERAL    0.000448
DEFECTO INCLINACION  ZONA CAJA 4    0.000112
DEFECTO INCLINACION  ZONA CAJA 8    0.000246
DEFECTO INCLINACION PENDULOS TTS    0.000470
DEFECTO INCLINACION  ZONA CAJA 7    0.000112
dtype: float64

2.2 Spindle¶

run simple bash script that cleans the provided sql dumps and outputs .csv file
read the files to pandas dataframes
create datetime columns and set the appropriate frequency to find out amount of missing data
show initial dataset statistics with added fraction of NA data

In [52]:

            
                Copied!
                
%%bash
./spindle_dataset_cleaning.sh \#datasets/spindle/transientEnero.sql
./spindle_dataset_cleaning.sh \#datasets/spindle/rawdatadiciembre2019Enero2020.sql
%%bash
./spindle_dataset_cleaning.sh \#datasets/spindle/transientEnero.sql
./spindle_dataset_cleaning.sh \#datasets/spindle/rawdatadiciembre2019Enero2020.sql

done
done

In [64]:

            
                Copied!
                
                    
                    
                
                

        
dir_loc = '#datasets/spindle/'
transient_data_loc = 'transientEnero.csv'
raw_data_loc = 'rawdatadiciembre2019Enero2020.csv'

transient_df = pd.read_csv(dir_loc + transient_data_loc)
transient_df['TS'] = pd.to_datetime(transient_df['TS'])
transient_df.drop(columns=['DriverId'], inplace=True )
transient_df = transient_df.pivot(index='TS', columns='TagId', values='Value')
transient_df = transient_df.asfreq('500ms')

raw_df = pd.read_csv(dir_loc + raw_data_loc)
raw_df['TS'] = pd.to_datetime(raw_df['TS'])
raw_df.drop(columns=['DriverId'], inplace = True)
raw_df = raw_df.pivot(index='TS', columns='TagId', values='Value')
raw_df = raw_df.asfreq('10s')
dir_loc = '#datasets/spindle/'
transient_data_loc = 'transientEnero.csv'
raw_data_loc = 'rawdatadiciembre2019Enero2020.csv'

transient_df = pd.read_csv(dir_loc + transient_data_loc)
transient_df['TS'] = pd.to_datetime(transient_df['TS'])
transient_df.drop(columns=['DriverId'], inplace=True )
transient_df = transient_df.pivot(index='TS', columns='TagId', values='Value')
transient_df = transient_df.asfreq('500ms')

raw_df = pd.read_csv(dir_loc + raw_data_loc)
raw_df['TS'] = pd.to_datetime(raw_df['TS'])
raw_df.drop(columns=['DriverId'], inplace = True)
raw_df = raw_df.pivot(index='TS', columns='TagId', values='Value')
raw_df = raw_df.asfreq('10s')

In [90]:

            
                Copied!
                
transient_desc = transient_df.describe()
transient_desc.loc['NA[%]'] = transient_df.isnull().sum()/len(transient_df)
transient_desc = transient_df.describe()
transient_desc.loc['NA[%]'] = transient_df.isnull().sum()/len(transient_df)

In [91]:

            
                Copied!
                
transient_desc
transient_desc

Out[91]:

TagId	32	34	141	142	143	144	145	148	149	150	161	162	163	164	165	168	169	170	172
count	432282.000000	432282.000000	432285.000000	432279.000000	432284.000000	432286.000000	432286.000000	432277.000000	432280.000000	432279.000000	432286.000000	432283.000000	432284.000000	432286.000000	432286.000000	432286.000000	432285.000000	432283.000000	432286.000000
mean	25.897840	13.023716	0.000476	0.644662	0.003985	-0.000654	0.000000	-0.043002	0.031707	0.118441	0.002180	295.464447	0.907546	2.588104	0.000000	6.286893	0.355521	6.014409	-1055.447342
std	67.697704	48.703252	0.029142	1.513568	0.084584	0.309192	0.000000	0.484449	0.203790	0.273886	0.730687	677.620427	12.299569	81.868908	0.000000	71.884351	26.799863	63.556399	2038.643775
min	0.000000	0.000000	-0.029687	-25.231930	-1.752502	-8.551758	0.000000	-5.962830	-4.960205	-3.726196	-87.364040	-10572.590000	-296.329300	-1326.921000	0.000000	-1352.971000	-1121.115000	-798.290800	-5004.182000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	3.000000	1.000000	0.000000	0.015259	0.000000	0.000000	0.000000	0.000000	0.000000	0.206787	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	241.000000	243.000000	2.389380	23.007200	2.597107	8.813232	0.000000	5.998535	6.157959	7.383911	224.621800	10316.320000	674.194900	2475.840000	0.000000	2108.840000	1789.496000	1482.262000	120.000000
NA[%]	0.696666	0.696666	0.696664	0.696668	0.696665	0.696664	0.696664	0.696670	0.696668	0.696668	0.696664	0.696666	0.696665	0.696664	0.696664	0.696664	0.696664	0.696666	0.696664

In [86]:

            
                Copied!
                
raw_desc = raw_df.describe()
raw_desc.loc['NA[%]'] = raw_df.isnull().sum()/len(raw_df)
raw_desc = raw_df.describe()
raw_desc.loc['NA[%]'] = raw_df.isnull().sum()/len(raw_df)

In [87]:

            
                Copied!
                
raw_desc
raw_desc

Out[87]:

TagId	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	141	142	143	144	145	146	147	148	149	150	161	162	163	164	165	166	167	168	169	170	171	172	173	174	175	176	177	178	179	181	189
count	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	44494.000000	266953.000000	266954.000000	266954.000000	266953.000000	266954.000000	266954.000000	266954.000000	176801.000000	176801.000000	5884.000000	176681.0000
mean	317.770036	0.000000	165.832831	0.000000	145.581359	313.353081	271.833506	0.000000	1.549198	0.000000	46.626669	0.000000	101.064773	0.000000	101.442891	0.000000	2.630512	0.000000	62.949881	26.433969	270.454398	0.000000	271.430148	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.003261	2.873562	0.016156	0.000401	0.000000	0.000000	0.000000	-0.050703	0.054979	0.202935	0.016483	1320.746245	4.845993	5.260343	0.000000	0.000000	0.000000	9.319065	1.260258	6.059729	0.000000	-3023.839267	0.000000	157.002051	233.711810	126.551113	82.959335	0.000000	0.000000	0.603671	1.0000
std	228.702356	0.000000	112.882197	0.000000	1.927477	68.116307	185.680372	0.000000	16.453591	0.000000	10.646463	0.000000	100.886253	0.000000	100.965684	0.000000	17.620232	0.000000	64.867502	4.066286	184.729166	0.000000	185.365134	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.076364	3.734183	0.188616	0.490344	0.000000	0.000000	0.000000	0.685685	0.306500	0.325223	1.640005	1634.323919	47.024100	124.670687	0.000000	0.000000	0.000000	140.192264	37.418039	66.507754	0.000000	2434.524827	0.000000	89.522301	51.350859	31.566791	41.287074	0.000000	0.000000	0.489176	0.0000
min	0.000000	0.000000	0.000000	0.000000	140.000000	148.000000	0.000000	0.000000	0.000000	0.000000	7.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	14.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-0.095557	-31.631470	-1.824097	-8.389404	0.000000	0.000000	0.000000	-5.715637	-6.696411	-3.740967	-76.944480	-10097.060000	-289.636100	-1789.578000	0.000000	0.000000	0.000000	-1363.325000	-1272.459000	-730.088600	0.000000	-5005.869000	0.000000	50.552000	0.000000	0.005000	0.000000	0.000000	0.000000	0.000000	1.0000
25%	0.000000	0.000000	0.000000	0.000000	144.000000	237.000000	0.000000	0.000000	0.000000	0.000000	41.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	23.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-0.014526	0.000000	0.000000	0.000000	0.000000	-0.321350	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-5000.027000	0.000000	64.696000	244.987000	109.876000	70.604000	0.000000	0.000000	0.000000	1.0000
50%	482.000000	0.000000	242.000000	0.000000	145.000000	350.000000	397.000000	0.000000	0.000000	0.000000	50.000000	0.000000	6.000000	0.000000	7.000000	0.000000	2.000000	0.000000	0.000000	28.000000	395.000000	0.000000	396.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.351929	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.006714	0.120850	0.000000	615.049600	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-4999.904000	0.000000	244.964000	244.997000	112.881000	71.544000	0.000000	0.000000	1.000000	1.0000
75%	485.000000	0.000000	243.000000	0.000000	147.000000	365.000000	400.000000	0.000000	0.000000	0.000000	52.000000	0.000000	201.000000	0.000000	201.000000	0.000000	3.000000	0.000000	125.000000	29.000000	398.000000	0.000000	399.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.541992	0.000000	0.000000	0.000000	0.000000	0.000000	0.151062	0.087280	0.374634	0.000000	2522.843000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	245.000000	245.002000	152.935000	71.553000	0.000000	0.000000	1.000000	1.0000
max	492.000000	0.000000	250.000000	0.000000	151.000000	389.000000	437.000000	0.000000	269.000000	0.000000	267.000000	0.000000	245.000000	0.000000	245.000000	0.000000	395.000000	0.000000	150.000000	37.000000	437.000000	0.000000	436.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2.275732	27.478030	3.529907	8.638916	0.000000	0.000000	0.000000	5.830994	6.868286	7.362427	160.501400	10948.450000	913.327200	2446.419000	0.000000	0.000000	0.000000	2122.646000	1100.979000	1531.627000	0.000000	150.000000	0.000000	245.005000	245.006000	220.023000	177.588000	0.000000	0.000000	1.000000	1.0000
NA[%]	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.889219	0.335341	0.335338	0.335338	0.335341	0.335338	0.335338	0.335338	0.559801	0.559801	0.985350	0.5601

3 Conclusion¶

3.1 Paintshop¶

we need inclication measuremnts to have inside knowledge of the system in order to predict rare events
we need as much data as possible to train our classifiers for each type of fault/label as in the analysed data 97.8% of the measuremnts did not have any fault

3.2 Spindle¶

we need faults
we need most recent measurements
we need to resolve the wast amount of missing measurements in the data as in the analysed data missing measurement feature values ranged from 33.5% to 98.5%