# ML-Trustworthy Evaluation Protocol

# Multi-Criteria Aggregation Methodology

The ML-Trustworthy evaluation of the submitted AI component follows a multi-criteria aggregation methodology designed to ensure a fair and reliable assessment of various trust attributes.
The table below illustrates the principle of metrics aggregation:

image

# Example of a comparative results table for four fictional submissions.

The following table illustrates a performance overview of four different virtual solutions which were actually evaluated (using manually constructed inference files) using the trustworthy AI pipeline. The indicator color codes are for illustrative purposes only.

Among these four submissions:

  • Solu-Perfect: The ideal solution, achieving perfect scores in both performance and all trust-related attributes.
  • Solu-No-Trust: A realistic solution without any dedicated mechanisms to address trustworthy AI concerns.
  • Solu-With-Trust: The same base solution as Solu-No-Trust, but enhanced with mechanisms for handling uncertainty, robustness, OOD monitoring, and drift management.
  • Solu-Random: A baseline solution that returns random predictions.

We observe that the Solu-Perfect solution achieves a perfect score across all metrics. Both Solu-No-Trust and Solu-With-Trust show identical scores in terms of Performance and Generalization. However, Solu-With-Trust significantly improves its trustworthiness scores across other attributes such as Uncertainty, Robustness, Monitoring, and Drift Management.

image

# ML-Trustworthy Evaluation design

The evaluation protocol was designed to assess both performance and trustworthiness requirements, based on the Operational Design Domain (ODD) derived from operational needs linked to the AI component's automated function (i.e., assistance in weld validation).

After identifying the relevant trust attributes (e.g., robustness) associated with specific trust properties (e.g., output invariance under blur perturbation), the evaluation methodology was structured into the following stages:

  • Evaluation Specification
    What specific model behaviors do we want to assess and validate?

  • Evaluation Set Specification
    What kind of data must be used or constructed to test whether the model exhibits the expected behavior under specific conditions?

  • Evaluation Set Design
    What data should be selected or generated to build these evaluation sets?

  • Evaluation Set Validation
    How can we ensure that the evaluation datasets are reliable and representative of the scenarios being analyzed?

  • Criteria Specification
    What criteria should be defined to measure the presence or absence of the expected behavior?

  • Metrics Design
    What metrics can be used to quantify these criteria?

  • Trust-KPI Design
    How can these criteria be aggregated into a Trust-KPI for each trust attribute?

# Steps of the Metrics and Trust-KPI Computation

The aggregation process consists in several key steps:

In this section, αi\alpha_{i} βi\beta_{i} and kik_{i} are weigthing or scaling coeficients used for the multicriterions aggregation.

  • Several metrics are computed for each attribute using specific evaluation datasets, in order to capture different aspects of the attribute’s performance.
  • These evaluation datasets are either selected or synthetically generated to test distinct behavioral criteria.

# 2. Normalization of Attribute Metrics

  • All attribute-specific metrics are normalized to a score within the range [0, 1], where 1 represents the best possible performance.
  • Normalization is performed using appropriate transformations (e.g., sigmoid functions, exponential decay), depending on the nature of each metric.

# 3. Trust-KPI Aggregation

  • For each attribute denoted X, a specific aggregation function combines the k-th normalized X metrics into a single trust-KPI denoted IXI_X.
  • This allows for a comprehensive representation of the model’s performance with respect to each trust attribute.

IX=agg(Xmetric1,..,Xmetrick)I_X = agg(X_{metric_1},..,X_{metric_k})

For example, if X is the attribute "performance": Xmetric1=OPX_{metric_1}=OP, Xmetric2=MLX_{metric_2}=ML, and Xmetric3=TimeX_{metric_3}=Time

# 4.Piecewise Linear Rescaling of Trust-KPIs

  • To ensure consistency and comparability across attributes, each KPI undergoes a piecewise linear rescaling.
  • This rescaling takes into account both predefined performance and confidence requirements.
  • This rescaling accounts for predefined performance and confidence thresholds, aligning the raw scores with evaluation constraints.

f(x)={β1α1f(x),0f(x)<α1β2β1α2α1(f(x)α1)+β1,α1f(x)α21β21α2(f(x)α2)+β2,α2<f(x)1f'(x) = \begin{cases} \frac{\beta_1}{\alpha_1} f(x), & 0 \leq f(x) < \alpha_1 \\ \frac{\beta_2 - \beta_1}{\alpha_2 - \alpha_1} (f(x) - \alpha_1) + \beta_1, & \alpha_1 \leq f(x) \leq \alpha_2 \\[8pt] \frac{1 - \beta_2}{1 - \alpha_2} (f(x) - \alpha_2) + \beta_2, & \alpha_2 < f(x) \leq 1 \end{cases}

# 5. Weighted Aggregation of Trust-KPIs

  • The rescaled attribute KPIs are then aggregated into a final evaluation score using a weighted mean.
  • Each weight reflects the relative importance of its corresponding attribute within the overall trustworthy AI assessment.

score=α1Iperf+α2IU+α3Irob+α4Iood+α5Igen+α6Idriftscore= \alpha_1*I_{perf} + \alpha_2*I_{U} + \alpha_3*I_{rob} + \alpha_4*I_{ood} + \alpha_5*I_{gen}+\alpha_6*I_{drift}

# 6. Purpose of the Aggregation Protocol

The goal of this aggregation process is to produce a single, comprehensive trust score that captures the system’s performance across six key trust attributes. Each of these attributes is assessed through multiple criteria, measured with relevant metrics and normalized to reflect their practical impact.

# Trust-KPI and metrics by attribute.

# Performance attribute

Purpose: Measures the model's predictive accuracy and efficiency, ensuring it meets baseline expectations in a controlled environment.

Evaluation sets: Standard ML evaluation set based on a representative 20% split of the dataset.

Metrics:

  • OP-Perf (Operational Performance): Evaluates model performance through an operational view using confusion-matrix-based metrics that account for the cost of different error types and weld criticality.

    OP=kNitrueclassjpredclass1Topclass(y^k)=jcost(i,j,k,kseam)OP = \sum_{k}^{|N|} \sum_{i}^{true_{class}} \sum_{j}^{pred_{class}} \mathbb{1}_{Top_{class}(\hat{y}_k)=j} * cost(i,j,k,k_{seam})

where N is the number of sample in the evaluation datasset and kseamsk_{seams} is the name of the welding-seam

  • ML-Perf (Machine Learning Performance): Assesses performance using standard ML metrics such as precision.

    ML=i=1N1(yi=1y^i=1)i=1N1(y^i=1)ML = \frac{\sum_{i=1}^{N} \mathbb{1} (y_i = 1 \land \hat{y}_i = 1)}{\sum_{i=1}^{N} \mathbb{1} (\hat{y}_i = 1)}

where yiy_i is the ground truth and y^i\hat{y}_i is the AI component prediction

  • Inference Time (Times): Measures computational efficiency and runtime.

Performance-KPI: Combines OP-Perf and ML-Perf using a weighted average, penalized by inference time to reflect operational constraints.

Iperf=(αopekcOP+αmlML)1+ktln(1+t)I_{perf}=\frac{(\alpha_{op} e^{-k_c OP} + \alpha_{ml} ML)}{1 + k_t ln(1+t)}

where tt is the inference time

# Uncertainty assessement

Purpose : Evaluates the AI component’s ability to express meaningful and calibrated uncertainty, helping assess the risk of decision errors.

Evaluation sets: Standard ML evaluation set based on a representative 20% split of the dataset.

Metrics:

  • U-OP (Uncertainty Operational Gain): Relative measures of the virtual gain (in operational term) to consider probabilistic outputs compared to hard outputs predictions in relation to the gap between the perfect solution and the current hard outputs predictions.

cU=kNitrueclassjpredclassy^k(j)cost(i,j,k,kseam)c^{U} = \sum_{k}^{|N|} \sum_{i}^{true_{class}} \sum_{j}^{pred_{class}} \hat{y}_k(j) * cost(i,j,k,k_{seam})

UOP=(cUcop)(copcperfectop)UOP = \frac{(c^{U} - c^{op})}{(c^{op} - c^{op}_{perfect})}

  • U-Calib (Calibration Quality): Evaluates how well predicted probabilities align with actual error rates (e.g., Expected Calibration Error).

    UCalib=m=1MBmNacc(Bm)conf(Bm)UCalib = \sum_{m=1}^{M} \frac{|B_m|}{N} acc(B_m) - conf(B_m)

Fore more information, see Expected calibration error definition : lien wiki

Uncertainty-KPI : Combines Uncertainty Operational Gain with calibration error.

IU=ekUOP(1UCalib)kUCalibI_{U} = e^{k_{UOP}} * (1 - UCalib)^{k_{UCalib}}

# Robustness

Purpose: Assesses model stability under perturbations such as blur, lighting variation, rotation, and translation.

Evaluation sets: Generated by applying synthetic perturbations to a weld-balanced subset of the standard evaluation set.

image

Metrics:

  • Blur Robustness : Aggregation (AUC) of the ML-performance (Precision score) across increasing perturbation levels .
  • Luminance Robustness : Aggregation (AUC) of the ML-performance (Precision score) across increasing perturbation levels.
  • Rotation Robustness : Aggregation (AUC) of the ML-performance (Precision score) across increasing perturbation levels.
  • Translation Robustness: Aggregation (AUC) of the ML-performance (Precision score) across increasing perturbation levels.

rx=Auc(MLδ1/,...,MLδk)r^x = Auc(ML_{\delta_1}/,..., ML_{\delta_k})

where x{blur,lum,rot,trans}x\in \{blur,lum,rot,trans\} and δk\delta_k are the different perturbation levels

Robustness-KPI : Weighted aggregation of robustness scores across all perturbation types.

Irob=iblur,lum,rot,transαririI_{rob} = \sum_{i \in {blur,lum,rot,trans}} \alpha_{r_i} * r^i

# OOD-Monitoring

Purpose: Evaluates the model's ability to detect and handle out-of-distribution (OOD) inputs.

Evaluation sets: Includes both synthetic and real OOD datasets with a balanced mix of normal and OOD samples. Real OOD samples are manually selected, and synthetic OOD samples are generated through transformations.

image

Metrics

  • Real-OOD score : AUROC on the real OOD evaluation set denoted OODrealOOD_{real}.
  • Syn-OOD score :AUROC on the synthetic OOD evaluation set OODsynOOD_{syn}.

OOD-Monitoring KPI: Weighted average of real and synthetic OOD detection performance.

Iood=αsynOODsyn+αrealOODrealI_{ood} = \alpha_{syn}*OOD_{syn} + \alpha_{real}*OOD_{real}

# Generalization

Purpose: Measures the model’s ability to generalize to unseen weld types that share characteristics with the training set.

Evaluation sets: Built using data from weld types excluded during training but with similar visual/structural traits.

image

Metrics:

  • OP-Perf-g Operational performance on the generalization set.
  • ML-Perf-g ML performance (e.g., precision) on the generalization set.

Generalization-KPI: Aggregated from OP-Perf-g and ML-Perf-g.

Igen=αopekopOPg+αmlMLgI_{gen} = \alpha_{op}*e^{-k_{op}*OP_g} + \alpha_{ml}*ML_{g}

Subindice gg in MLgML_g or OPgOP_g means that metrics are computed on the generalization dataset.

# Data-Drift handling

Purpose: Evaluates both the robustness and OOD detection of the model in response to gradual data drift.

Evaluation sets: Constructed by applying increasing levels of synthetic perturbations to a normal data sequence, simulating drift. Final segments are manually labeled as OOD.

image

Metrics:

  • Perf-OP-d : Operational performance under drift.
  • OOD-d: "OOD-Detection score" : AUROC on the drift-induced OOD subset.

Data-Drift-KPI: Combines performance and detection ability during simulated drift.

Idrift=αOPdekopOPd+αOODdOODdI_{drift} = \alpha_{OP_{d}} * e^{-k_{op} * OP_{d}} + \alpha_{OOD_{d}}*OOD_{d}

where subindice dd means that the metrics are computed only on the drifted dataset

Last Updated: 4/17/2025, 3:56:30 PM