Chunk 0
--- Page 1 ---
Probabilistic Verification of Neural Networks via
Efficient Probabilistic Hull Generation
Jingyang Li1, Xin Chen2, Hongfei Fu1, and Guoqiang Li1
1 Shanghai Jiao Tong University, Shanghai 200240, China
{lijjjjj, jt002845, li.g}@sjtu.edu.cn
2 University of New Mexico, Albuquerque, NM 87106, USA
chenxin@unm.edu
Abstract. The problem of probabilistic verification of a neural network
investigates the probability of satisfying the safe constraints in the out-
put space when the input is given by a probability distribution.
Chunk 1
It is
significant to answer this problem when the input is affected by distur-
bances often modeled by probabilistic variables. In the paper, we propose
a novel neural network probabilistic verification framework which com-
putes a guaranteed range for the safe probability by efficiently finding
safe and unsafe probabilistic hulls.
Chunk 2
Our approach consists of three main
innovations: (1) a state space subdivision strategy using regression trees
to produce probabilistic hulls, (2) a boundary-aware sampling method
which identifies the safety boundary in the input space using samples
that are later used for building regression trees, and (3) iterative refine-
ment with probabilistic prioritization for computing a guaranteed range
for the safe probability. The accuracy and efficiency of our approach are
evaluated on various benchmarks including ACAS Xu and a rocket lan-
der controller.
Chunk 3
The result shows an obvious advantage over the state of
the art. Keywords: Neural Network · Formal Verification · Probabilistic Verifi-
cation.
Chunk 4
1
Introduction
Formal verification of deep neural networks (DNNs) is crucial for ensuring the
trustworthiness and the reliability of learning-enabled systems in safety-critical
scenarios [10, 13]. While most existing techniques aim at solving the output range
analysis problem of DNNs [7, 20, 23], it is still difficult to know the safety of a
DNN when the input is affected by noises that are often modeled by random
variables.
Chunk 5
Since noises or disturbances are often described by Gaussian variables,
their values are not bounded and can hardly be handled by existing output range
analysis methods. We propose a new approach which generates a guaranteed
range for the safety of a DNN when its input is Gaussian variables.
Chunk 6
Probabilistic verification [17, 9, 16, 2, 14] involves analyzing the probability of
safety violations under specified input distributions instead of giving a determin-
istic answer to the safety query. In order to estimate the probability of satisfying
arXiv:2604.21556v1 [cs.AI] 23 Apr 2026
--- Page 2 ---
2
J.
Chunk 7
Li et al. the safe constraints in the output space of a DNN, a straightforward approach
could be to compute the output distribution and then integrate its density func-
tion over the safe output space.
Chunk 8
However, it is often not feasible due to the
complexity of DNNs. In this paper, we propose a method which conservatively
estimate the safety of a DNN based on a set of safe or unsafe probabilistic hulls
that at most intersect on facets.
Chunk 9
A probabilistic hull is a closed and bounded set
in the input space whose orientation is aligned with the Gaussian distribution of
the input, such that the probability of it can be efficiently computed based on
the error function [1]. An upper bound of safety can be obtained by subtracting
the probabilities of all probabilistic hulls which are purely unsafe, while a lower
bound can be derived by adding the probabilities of all safe hulls.
Chunk 10
The main
contribution of our method is an efficient search strategy for finding large safe
and unsafe hulls. Existing probabilistic neural network verification methods suffer from the
curse of dimensionality.
Chunk 11
Typically, the existing set-based techniques [23, 15] often
prioritize partitioning the input space uniformly that cannot recognize critical
regions. This leads to exponential amount of subdivisions and is time consuming.
Chunk 12
Other methods, such as ProbStar [14] symbolically propagates the input set to
the output space and then estimate their safe probability, it is able to more
accurately track the input-output dependency however cannot handle the DNNs
whose activations are not ReLU. We study the probabilistic verification problem on feedforward DNNs.
Chunk 13
We
propose a novel regression tree-guided probabilistic verification framework that
achieves both efficiency and generality. Our key insight is that neural network
decision/safe boundaries, i.e., the boundary dividing the safe and unsafe set in
the output space, can be effectively approximated through adaptive region par-
titioning guided by sampling-based regression trees, eliminating the unnecessary
exploration and subdivision in the regions that are purely safe or unsafe.
Chunk 14
Our
approach makes three primary contributions: (1) We introduce a regression tree-
guided state space splitting strategy for the input space that efficiently generates
purely safe or unsafe probabilistic hulls of large probabilities. It avoids the un-
necessary subdivisions in the safe or unsafe regions in most practical cases.
Chunk 15
(2)
We propose a boundary-aware sampling method which effectively identifies the
safety boundary in the input space by random samples which are later used to
build regression trees. (3) We develop an incremental sampling mechanism that
iteratively refines the undecided regions with the largest probability, avoiding
the prohibitive cost of dense sampling across the entire input space.
Chunk 16
Experimen-
tal evaluation on the challenging benchmarks demonstrates that our method
achieves superior accuracy against the state of the art and achieves up to 10×
speedup compared to basic branch-and-bound approaches. 2
Related Work
Linear bound propagation techniques, exemplified by CROWN [23], have rev-
olutionized neural network verification by extending traditional interval bound
--- Page 3 ---
Probabilistic Verification of Neural Networks
3
propagation to linear function-based bound propagation.
Chunk 17
This advancement sig-
nificantly improves scalability while maintaining reasonable tightness for veri-
fication tasks. Building upon CROWN, β-CROWN [15] integrates branch-and-
bound (BaB) techniques with GPU acceleration to achieve both completeness
and scalability in verification processes.
Chunk 18
Probabilistic verification has emerged as a critical paradigm for quantify-
ing neural network safety under stochastic inputs. One part of the research
focus on Bayesian neural network verification [3, 18, 19].
Chunk 19
Early approaches like
PROVEN [17] established distribution-aware robustness guarantees by integrat-
ing probabilistic measures with formal verification techniques. Subsequently, hy-
brid methods [9] combined abstract interpretation with Monte Carlo sampling to
balance computational efficiency with probabilistic assurances.
Chunk 20
Statistical evalu-
ation frameworks [16] further advanced the field by employing adaptive sampling
strategies to estimate perturbation tolerance while quantifying violation proba-
bilities. Besides, [12] proposes a characteristic-function-based probabilistic ver-
ification method which estimates the safe probability via frequency-domain.
Chunk 21
Recent developments have focused on scalability. The probabilistic branch-
and-bound approach [2] iteratively refines probability bounds through dynamic
region subdivision, achieving improved verification success rates but suffering
from exponential complexity in high-dimensional spaces.
Chunk 22
Most notably, Prob-
Star [14] introduces probabilistic star-based reachability analysis for efficient
uncertainty propagation in safety-critical systems. However, ProbStar is limited
to ReLU networks due to its reliance on linear region enumeration, restricting
its applicability to networks with generic activation functions.
Chunk 23
3
Preliminaries
A deterministic input to a DNN can be denoted by a column vector x =
(x1, . .
Chunk 24
. , xd).
Chunk 25
Therefore, a noise-affected input can be denoted by x + w where
w = (w1, . .
Chunk 26
. , wd) are the noises for each input variable.
Chunk 27
In this paper, we fo-
cus on Gaussian noises, i.e., w subject to a multivariate Gaussian distribution,
since they are the most popular models for noises in practice. Since Gaussian
distributions are closed under affine transformations, a noise-affected input can
be represented by Gaussian variables, i.e., x ∼N(µ, Σ) where µ is the mean and
Σ is the covariance matrix.
Chunk 28
Problem Formulation. Let f : Rd →Rm denote the input-output mapping of a
(feedforward) DNN which has d inputs and m outputs.
Chunk 29
Given a Gaussian input
x ∼N(µ, Σ) and a safety property in the output space: Γ(y) : Cy ≥a, such
that the property is defined by a finite set of affine constraints/inequalities. Here,
y is a point/vector in the output space, i.e., y = f(x), C ∈Rk×m is a matrix of
coefficients, and a ∈Rk is a constant vector.
Chunk 30
The DNN probabilistic verification
problem asks to estimate the probability
P(Γ(y) | x ∼N(µ, Σ)). (1)
--- Page 4 ---
4
J.
Chunk 31
Li et al. As we pointed out that an exact calculation of the probability (1) is in-
tractable.
Chunk 32
Therefore, we conservatively estimate an upper bound as well as a
lower bound for the actual probability via computing a finite set of safe and
unsafe probabilistic hulls. Given a set of Gaussian variables x ∼N(µ, Σ) such that Σ = diag(σ2
1, .
Chunk 33
. .
Chunk 34
, σ2
d),
i.e., all variables subject to independent distributions. The probability P(x ∈H),
or simply P(H), where the hull H is a box [a1, b1]×· · ·×[ad, bd] can be computed
by evaluating the expression Q
1≤i≤d P(ai ≤xi ≤bi) such that
P(ai ≤xi ≤bi) = 1
2
erf(bi −µ
√
2σi
) −erf(ai −µ
√
2σi
)
where erf is the error function [1].
Chunk 35
For any full rank covariance matrix Σ, a
hull H aligned with its orientation can be calculated in a similar way. That
is, we may transform Σ to a diagonal matrix, transform H to a box using the
same mapping and evaluate the probability of being in H with the transformed
distribution.
Chunk 36
Hence, for simple denotation, we assume that Σ is diagonal. In a
DNN verification problem, a hull H is safe (unsafe resp.) when all points in its
output set f(H) satisfying (violating resp.) the safe property Γ.
Chunk 37
Input
f, Γ, µ, Σ, ε
Initialize
Runknown = [Space]
Rsafe, Runsafe = ∅
P
R∈Runknown P(R) ≤ε? Output
Safe probability interval [Ls, Us]
Select Region
Region Rnext with
largest probability in Runknown
Boundary-Aware Subdivision
Extract probabilsitic hulls in Rnext
Verify and Update Region Sets
Add to Rsafe/Runsafe/Runk
No
Yes
Fig.
Chunk 38
1. Workflow of our regression tree-guided probabilistic verification framework.
Chunk 39
4
Main Approach
The main idea of our approach for estimating conservative upper and lower
bounds for the actual safe probability is to iteratively construct a set of safe hulls:
Hs
1, . .
Chunk 40
. , Hs
N and a set of unsafe hulls: Hu
1 , .
Chunk 41
. .
Chunk 42
, Hu
M, such that all hulls intersect
at most on the facets. Therefore, each hull represents an independent probability
from the others.
Chunk 43
Then, the upper bound is computed as Us = 1−P
1≤i≤M P(Hu
i ),
while the lower bound is obtained as Ls = P
1≤i≤N P(Hs
i ). Theorem 1.
Chunk 44
If Hs
1, . .
Chunk 45
. , Hs
N are safe and Hu
1 , .
Chunk 46
. .
Chunk 47
, Hu
M are unsafe, then we have
that Us ≥Ls, and the actual safe probability must be in the interval [Ls, Us]. --- Page 5 ---
Probabilistic Verification of Neural Networks
5
Obviously, if the probabilistic hulls can be efficiently found and have high
probabilities, an accurate range for safe probability can be obtained.
Chunk 48
Unlike the
related methods, the probabilistic hulls are not generated from random samples. Instead, we build (safe) boundary-ware regression trees which subdivide the
input space into box regions, then we use CROWN to verify the safety of them
and identify the safe and unsafe subregions as hulls.
Chunk 49
In the experiments, we show
that such a strategy greatly outperforms the method using uniform sampling. The high-level workflow of our framework is given in Figure 1.
Chunk 50
Given a DNN
f, a safety property Γ, a Gaussian distribution N(µ, Σ) for the input, and a ter-
mination threshold ε > 0. Here, we only need to use the DNN as a blackbox.
Chunk 51
The
main algorithm keeps three sets of probabilistic hulls: Rsafe, Runsafe, Runknown
that are used to keep the set of safe, unsafe and unknown hulls respectively. Ini-
tially, Runknown has a single element which represents the whole state space, while
Rsafe, Runsafe are both empty.
Chunk 52
The whole state space can be represented by a large
bounded set which is of a probability sufficiently close to 1 (e.g. ≥99.999%) for
the inputs, instead of an unbounded support range from Gaussian distributions.
Chunk 53
Then the main loop performs the following steps: (1) It chooses the hull Rnext
which has the highest probability in Runknown and delete it from Runknown; (2)
Rnext is subdivided using our boundary-aware subdivision strategy which builds
a regression tree and the subdivisions are new probabilistic hulls obtained as the
tree leafs; (3) We verify the output sets of the subdivisions. We add the safe
hulls to Rsafe, the unsafe hulls to Runsafe, and the hulls that are not purely safe
or unsafe to Runknown.
Chunk 54
The loop terminates when the summation of the proba-
bilities of the hulls in Runknown is < ε. The details are described in the following
subsections.
Chunk 55
4.1
Boundary-Aware Subdivision Strategy
Uniform Subdivision
Unk
Unk
Safe
Unk
Boundary-Aware Subdivision
Unk
Safe
Unsafe
Unk
— Boundary
- - Splits
• Samples
Fig. 2.
Chunk 56
Uniform vs. boundary-aware subdivision.
Chunk 57
Efficiently finding large safe/unsafe hulls plays a key role in our framework. A naive way could be uniformly subdivide the input state space and verify the
safety of each piece.
Chunk 58
However, it is often very expensive and ignores the decision
boundary. Directly considering the boundary in probabilistic hull search can be
--- Page 6 ---
6
J.
Chunk 59
Li et al. difficult, since the boundary is in the output space and the subdivision in per-
formed in the input space.
Chunk 60
However, it is still possible to “backpropagate” the
information of the boundary to the subdivision process. Figure 2 gives an intu-
itive comparison between uniform subdivision and boundary-aware subdivision.
Chunk 61
The red curve indicates the backpropagated boundary to the input space, it is
often not computable and unknown to our process. It can be seen from the figure
that, using boundary-aware subdivision can better produce large hulls which do
not intersect the boundary and are pure safe or unsafe.
Chunk 62
Our boundary-aware subdivision method is outlined in Algorithm 1. Given
a DNN f, a region R in the input space, number of samples n and the safety
specification defined by the parameters C and a.
Chunk 63
Our whole boundary-ware sub-
division process performs the following steps: (1) It samples a point set Sbase
with n points within the given region R and check the safety of their outputs; (2)
If both safe and unsafe outputs are found, it conducts boundary-aware sampling
by rejecting the samples whose outputs are sufficiently far from the boundary,
and we obtain a point set S; (3) Otherwise, we set S = Sbase since the boundary
might not be in R; (4) We use the samples in S to build a regression tree T
in the region R, where each leaf represents a subdivision of R. The details of
boundary-aware sampling are given as below.
Chunk 64
Combined Sampling Strategy Foundation. We introduce the setup of our base
sampling strategy.
Chunk 65
Our base sampling strategy is the combination of the distribu-
tional sampling of the input distribution N(µ, Σ) and the uniform distribution. Given an integer n, we sample ⌊n/2⌋points with distributional sampling and
sample n −⌊n/2⌋points with uniform sampling.
Chunk 66
We adopt the combination in-
stead of only one of them because (i) we need to take the distribution of the
input variables into account since it helps to find probabilistic hulls in the space
of high probability density and make the hulls of high probability, while (ii) ad-
ditionally considering a uniform distribution helps us to sample the margin of a
region, and have balanced safe and unsafe samples. Boundary Awareness Through Elimination.
Chunk 67
In order to better reflect the safe
boundary in R, we check our samples to see if both safe and unsafe points are
generated. If not, we do not start the rejection sampling step.
Chunk 68
Otherwise, we aim
at keeping the samples near the boundary. Although it is difficult to compute
the boundary in the input space, the distance of a sample x to the boundary can
be measured as the distance between f(x) and the boundary.
Chunk 69
Then we apply
an elimination mechanism to remove the samples that are sufficiently far from
the boundary. To do so, we associate a probability distribution for eliminating a
sample.
Chunk 70
For each sample x, we compute its output f(x) and the distance to the
safety boundary Cf(x) −a. Then, the elimination probability is given by
Pe(x) = 1 −exp(−rank(|Cf(x) −a|/σ))
where rank returns the corresponding index in the sorted distance from the
nearest to the farthest.
Chunk 71
--- Page 7 ---
Probabilistic Verification of Neural Networks
7
Algorithm 1: Boundary-Aware Subdivision
Input: Region R, number of samples n, network f, C, a
Output: Boundary-aware regression tree T
// Combined sampling foundation
1 Sbase ←CombinedSample(R, n);
// Check if rejection sampling is needed
2 if Sbase contains only safe or unsafe points then
// No boundary in region, return base samples
3
return Sbase
4 end
// Rejection sampling
5 S ←∅;
6 nattempts ←0;
7 while |S| < n and nattempts < max_attempts do
8
x ←CombinedSample(R, n)
9
xaccept ←EliminateSamples(C, f, x)
10
S ←S ∪{xaccept};
11
nattempts ←nattempts + 1;
12 end
// Fill remaining samples if not enough sample points
13 if |S| < n then
14
S ←S∪CombinedSample(R, n −|S|);
15 end
16 T ←BuildRegressionTree(S, R)
17 return T
Regression Tree Construction. We use the sample set after elimination to build
a regression tree which is expected to better split the region along the boundary.
Chunk 72
The tree construction follows the standard algorithm [11]. For each non-leaf
node, the impurity of the samples is measured as
Impurity = MSE(Sleft, Sright)
Lα
where Sleft, Sright are the split sets, L is the variation of the samples in the
dimension of subdivision, and α is a user-specified hyperparameter controlling
the geometric penalty.
Chunk 73
The MSE is computed as MSE(Sleft, Sright) = Var(yi|xi ∈
Sleft) · |Sleft| + Var(yi|xi ∈Sright) · |Sright|. Each subdivision is performed to
minimize the impurity of the two subdivisions.
Chunk 74
--- Page 8 ---
8
J. Li et al.
Chunk 75
Algorithm 2: The Main Algorithm
Input: Network f, C, a, distribution p(x), threshold ε
Output: Safe probability interval [Ls, Us]
1 Initialize: Runknown ←[entire_input_space];
2 Rsafe ←∅, Runsafe ←∅;
3 while P
R∈Runknown P(R) ≥ε do
// Select region with maximum probability mass
4
Rnext ←SelectLargestProbRegion(Runknown);
// Targeted sampling within selected region
5
S ←BoundaryAwareSample(Rnext, small_sample_size);
// Build regression tree and extract leaf regions
6
T ←BuildRegressionTree(S, Rnext);
7
{R1, ..., Rk} ←ExtractLeafRegions(T);
// Verify each leaf region
8
{s1, ..., sk} ←CROWNVerify({R1, ..., Rk}, C, a);
// Update region sets
9
Remove Rnext from Runknown;
10
for i ←1 to k do
11
if si = Safe then
12
Rsafe ←Rsafe ∪{Ri};
13
else if si = Unsafe then
14
Runsafe ←Runsafe ∪{Ri};
15
else
16
Runknown ←Runknown ∪{Ri};
17
end
18
end
19 end
20 Ls ←P
R∈Rsafe P(R);
21 Us ←1 −P
R∈Runsafe P(R);
22 return [Ls, Us]
Algorithm 2 presents the main algorithm that is a detailed treatment of
the workflow in Figure 1. It operates through an iterative refinement loop that
progressively subdivides the input space.
Chunk 76
Rather than attempting to resolve all
unknown hulls equally, we strategically select and refine only the candidates with
the highest probability mass among the unknown hulls. 4.2
Analysis of the Framework
Correctness of the Result.
Chunk 77
We investigate the correctness of the result from
Algorithm 2. First, all of the probabilistic hulls generated intersect at most
on the facets.
Chunk 78
To see that, each hull is obtained as a leaf of a regression tree
which subdivide a rougher hull into sub-hulls whose intersections are at most
the facets. Therefore, each probabilistic hull in the set Rsafe and Runsafe has a
probability which is independent from the probabilities represented by the other
--- Page 9 ---
Probabilistic Verification of Neural Networks
9
hulls.
Chunk 79
Next, we show that Us is an over-estimate of the actual safe probability
and Ls is an under-estimate of probability. Due to the conservativeness of the
tool CROWN, all hulls in Runsafe are unsafe and all hulls in P
R∈Rsafe are safe.
Chunk 80
Since the probabilities of the hulls in Runsafe are independent, we infer that the
DNN is at least 1 −Us = P
R∈Runsafe P(R) unsafe, and thereby at most Us safe. Analogously, the DNN is at least Ls = P
R∈Rsafe P(R) safe since the hulls are
safe and representing independent probability.
Chunk 81
We also have that Ls ≤Us. Termination of the Main Algorithm.
Chunk 82
We show that Algorithm 2 will eventually
terminate with Us−Ls < ε. Our main algorithm can be viewed as a new branch-
and-bound scheme with probability density and boundary-aware subdivision.
Chunk 83
During the iterations, the sum of the volumes of the hulls in the unknown set
Runknown becomes smaller and smaller due to the repeated refinement. A proof
for the termination can be directly obtained by adapting the termination proof of
the standard branch-and-bound method [6].
Chunk 84
The volume of the region defined by
the union of all undecided hulls, i.e., the hulls in the set of Runknown, converges to
zero when the number of iterations goes to infinity. Hence, the total probability
of all hulls in Runknown will eventually be below the given threshold ε after a
finite number of iterations.
Chunk 85
It is also the probability that lies out of the confirmed
safe and unsafe probability estimates, i.e., Us −Ls < ε. Theorem 2.
Chunk 86
Algorithm 2 terminates with an interval that contains the actual
safe probability of the DNN. When Algorithm 2 terminates, the union of the hulls in Runknown forms an
overapproximation of the image of the safety boundary via the inverse mapping
of the DNN.
Chunk 87
5
Experiments
5.1
Experimental Setup
Platform and Tools. Our experiment is conducted on a Linux server running
Ubuntu 18.04, equipped with an Intel Xeon E5-2678 v3 processor and a GPU of
Nvidia RTX 4090.
Chunk 88
We also use auto_LiRPA [21] library to compute the bounds,
and our implementation takes as input neural networks in PyTorch format. The
safety verification of probabilistic hulls are performed by CROWN [23].
Chunk 89
Our full source code, scripts, and necessary experimental data are publicly
available on GitHub at https://github.com/LIJYann/Regression-Tree-Guided-
Probabilistic-Hull-Generation. The repository corresponds to the code version
used for this submission, ensuring reproducibility.
Chunk 90
Comparison Baselines. We compare our approach with two state-of-the-art
tools, ProbStar [14] and the basic branch-and-bound (BaB) method [2] (binary
partition on the longest dimension).
Chunk 91
The comparison is done on three groups
of DNN benchmarks: ACAS Xu, the rocket lander controller benchmark for
SpaceEx Falcon9, and the tanh-based DNNs distilled from those in ACAS Xu
--- Page 10 ---
10
J. Li et al.
Chunk 92
(see [4]). We use a less restrictive condition, maxR∈Runknown P(R) ≤ε, for the
termination condition in order to reach a reasonable tradeoff between efficiency
and accuracy.
Chunk 93
This condition is used in both our method and the BaB in order
to give a fair comparison. The parallel computation in BaB and our approach
are conducted on GPU with auto_LiRPA.
Chunk 94
However, due to the implementation
of ProbStar, it can only be parallelized on CPU, and we use the default parallel
setting for ProbStar. The experimental results from the tools are compared based
on the runtime and the tightness of the upper and lower bounds for the actual
safe probability.
Chunk 95
Hyperparameter Settings. To systematically evaluate our regression-tree-
guided probabilistic verification method across different parameter configura-
tions, we conducted a comprehensive grid search on each benchmark.
Chunk 96
We test all
combinations of key parameters through their Cartesian product, enabling us to
identify optimal configurations. We examined three critical parameters: sampling
strategy, regression tree depth, and impurity hyperparameter settings.
Chunk 97
Overall,
the configurations can be described as ((wuniform, wdistribution), depth, (α, β)):
– Sampling strategy: Weight configurations balancing uniform and distribu-
tional sampling (wuniform, wdistribution);
– Regression tree depth: Depth limits on regression tree;
– Impurity scheduler: Combined the proposed α-weighted measure with longest-
dimension splitting. Initial splits follow the longest dimension until the hull
probability reaches threshold β, then transition to the soft-coded impurity
measure (α, β).
Chunk 98
Fig. 3.
Chunk 99
Boundary-aware subdivision for the toy example. Illustrative Toy Examples.
Chunk 100
Before we show the comparisons, Figure 3 il-
lustrates the probabilistic hulls computed for the toy example described in [14]. --- Page 11 ---
Probabilistic Verification of Neural Networks
11
The red boxes are the probabilistic hulls that are identified as unsafe.
Chunk 101
The green
boxes are the safe hulls. While the yellow ones are the boxes whose corresponding
outputs intersect the safety boundary.
Chunk 102
The union of the yellow boxes forms an
overapproximation of the image of the safety boundary under the inverse map-
ping of the neural network. It can be seen that the state far from the boundary
is often included by large probabilistic hulls, i.e., some unnecessary subdivisions
can be avoided using our boundary-aware method.
Chunk 103
The rest of the section ad-
dresses the experimental evaluation goals as mentioned previously. 5.2
ACAS Xu Benchmark
The ACAS Xu collision avoidance system benchmarks [8] consist of 45 networks
(5×9 configurations) with 5 inputs, 6 layers, and 50 neurons per layer.
Chunk 104
We test
Property P2 in the benchmarks which requires the network not to output “COC”
as the optimal action. This means the network output is unsafe when the first
output is the minimal.
Chunk 105
In this comparison, all methods filter regions (branches
in ProbStar’s case) smaller than 10−5. We use 1000 initial sampling points and
100 iterative sampling points in our approach.
Chunk 106
As for the hyperparameters, we performed the grid search on ACAS network
1-6 for specification 2 and determine the optimal configuration to be pure dis-
tributional sampling (0.0,1.0), depth 5, β = 0.75, α = 0.05. We therefore adopt
these hyperparameter settings for the ACAS benchmark.
Chunk 107
The full illustration of
the grid search can be viewed in Section A. Table 1 shows the results for all tools, where the best bounds and runtimes
are marked in bold (and for all subsequent tables).
Chunk 108
From the table, we observe
that BaB times out on all the benchmarks, and our method mostly generates
better bounds than ProbStar. Specifically, our bounds uniformly outperforms
ProbStar in the bounds for unknown probabilities Us −Ls, and our runtime
is comparable with ProbStar.
Chunk 109
Besides, the parallel paradigm originated from
CROWN yields substantial performance gains, with an average speedup of 4.8×
compared to our non-parallel version. 5.3
Rocket Lander Controller Benchmark
To further demonstrate the ability of our method to handle high-dimensional
neural networks, we conduct evaluation on the Rocket Lander [22], which sim-
ulates the vertical landing control of a SpaceX Falcon 9 first-stage rocket.
Chunk 110
The
objective is to safely land the rocket on a sea-based platform from a given height,
which requires precise control of both the rocket’s velocity and lateral angle. The control system outputs three components: a main engine thruster (FE ∈
[0, 1]) with an adjustable angle (ϕ), and two side nitrogen thrusters (FS ∈[−1, 1],
where -1 and 1 indicate full throttle of right and left thruster respectively).
Chunk 111
The
system state is represented by a 9-dimensional vector: (x, y, vx, vy, θ, ω, F ′
E, F ′
S, ϕ′),
where (x, y) is the position relative to the landing center; (vx, vy) are the hor-
izontal and vertical velocities; θ is the lateral angle; ω is the angular velocity;
and (F ′
E, F ′
S, ϕ′) are the previous control actions. --- Page 12 ---
12
J.
Chunk 112
Li et al. Table 1.
Chunk 113
Results on ACAS Xu Benchmark for Property P2. Legend - Net: DNN
benchmarks, Approach: our approach ("Ours"), ProbStar and BaB, Ls: the lower
bound of the safety probability interval, Us: the upper bound of the safety probability
interval, Us −Ls: upper bound for the unknown probability, Time (non parallel): the
runtime without parallel computing in seconds, Time (parallel): the runtime with par-
allel computing in seconds, T.O.: Time out, >2 hours.
Chunk 114
Net Approach
Ls
Us
Us −Ls
Time (non parallel) Time (parallel)
1-6
Ours
0.984714 1.000000 0.015286
124.21
46.58
1-6
ProbStar
0.933739 0.997192 0.063453
2840.35
315.75
1-6
BaB
-
-
-
T.O. T.O.
Chunk 115
2-2
Ours
0.924122 0.991622 0.067500
1248.70
240.79
2-2
ProbStar
0.892540 0.980435 0.087895
3869.90
430.20
2-2
BaB
-
-
-
T.O. T.O.
Chunk 116
2-9
Ours
0.954046 0.999941 0.045895
819.22
249.24
2-9
ProbStar
0.879616 0.999745 0.120129
6715.75
746.56
2-9
BaB
-
-
-
T.O. T.O.
Chunk 117
3-1
Ours
0.923320 0.972056 0.048735
949.94
189.76
3-1
ProbStar
0.913943 0.969500 0.055557
2462.36
273.73
3-1
BaB
-
-
-
T.O. T.O.
Chunk 118
3-6
Ours
0.914964 0.982248 0.067284
1808.11
359.85
3-6
ProbStar
0.879708 0.979217 0.099509
5191.90
577.16
3-6
BaB
-
-
-
T.O. T.O.
Chunk 119
3-7
Ours
0.923570 0.999888 0.076318
1817.89
386.33
3-7
ProbStar
0.911578 0.997681 0.086103
4187.35
465.49
3-7
BaB
-
-
-
T.O. T.O.
Chunk 120
4-1
Ours
0.961700 0.999953 0.038253
884.45
155.31
4-1
ProbStar
0.914955 0.998963 0.084008
3120.21
346.86
4-1
BaB
-
-
-
T.O. T.O.
Chunk 121
4-7
Ours
0.913432 0.979347 0.065914
1604.20
381.73
4-7
ProbStar
0.878505 0.979215 0.100710
3917.94
435.54
4-7
BaB
-
-
-
T.O. T.O.
Chunk 122
5-3
Ours
0.973373 1.000000 0.026627
336.51
81.21
5-3
ProbStar
0.953952 1.000000 0.046048
1306.79
145.27
5-3
BaB
-
-
-
T.O. T.O.
Chunk 123
--- Page 13 ---
Probabilistic Verification of Neural Networks
13
We directly use the control policy network provided in ProbStar. The network
has 9 inputs, 3 outputs, and 5 hidden layers containing 20 ReLU neurons each.
Chunk 124
We focus on two critical safety properties that ensure proper tilt control:
– P1 (Right Tilt Prevention): For states where θ ∈[−20◦, −6◦], ω < 0, ϕ′ ≤0,
and F ′
S ≤0, the network must output either ϕ < 0 or FS < 0. – P2 (Left Tilt Prevention): For states where θ ∈[6◦, 20◦], ω ≥0, ϕ′ ≥0, and
F ′
S ≥0, the network must output either ϕ > 0 or FS > 0.
Chunk 125
As for the hyperparameters, we performed the grid search on Net 1 for
specification 1 and determine the optimal configuration to be mixed sampling
(wuniform = 0.25, wdistribution = 0.75), depth 5, β = 0.0, α = 0.05. We there-
fore adopt these hyperparameter settings for this benchmark.
Chunk 126
The scatter point
figure of the grid search can be viewed in Section A. We conduct experiments on the Rocket Lander controller to compare our
method with ProbStar.
Chunk 127
In this comparison, all methods filter regions (branches
in ProbStar’s case) smaller than 10−3. We do not run BaB since it cannot handle
high-dimensional input due to inherent combinatorial explosion.
Chunk 128
For our method,
we use 9000 initial sampling points and 900 iterative sampling points to adapt to
the high input dimension of this benchmark. Similar to the setting in Table 1, we
compare both the bounds and the runtime (including parallel and non-parallel).
Chunk 129
The results are shown in Table 2 (non-parallel) and Table 3 (parallel). From these
tables, we observe that our method consistently generates significantly tighter
bounds for the unknown probability (Us −Ls) across all configurations.
Chunk 130
Further-
more, our method demonstrates superior time efficiency, achieving substantial
runtime reductions in both non-parallel and parallel settings. Table 2.
Chunk 131
Experiments on Rocket Lander Controller (non-parallel). Net & Prop means
different configurations in the original benchmark.
Chunk 132
Others are the same as Table 1. Net & Prop Approach Us −Ls
Time (s)
0 & 1
ProbStar
0.715046
418.183
0 & 1
Ours
0.650548 171.933
0 & 2
ProbStar
0.831752
597.775
0 & 2
Ours
0.262993
85.482
1 & 1
ProbStar
0.554032
438.278
1 & 1
Ours
0.229874
70.723
1 & 2
ProbStar
0.356913
319.571
1 & 2
Ours
0.079113
23.287
5.4
Efficiency of Finding Probabilistic Hulls
We compare the efficiency of our method and the BaB on finding safe and unsafe
probabilistic hulls, since both of the methods use probabilistic hulls to estimate
--- Page 14 ---
14
J.
Chunk 133
Li et al. Table 3.
Chunk 134
Experiments on Rocket Lander Controller (parallel). The legend is the same
as Table 2.
Chunk 135
Net & Prop Approach Us −Ls
Time (s)
0 & 1
ProbStar
0.715046
141.047
0 & 1
Ours
0.651642
33.767
0 & 2
ProbStar
0.831752
225.331
0 & 2
Ours
0.265237
23.573
1 & 1
ProbStar
0.554032
225.331
1 & 1
Ours
0.230962
18.900
1 & 2
ProbStar
0.356913
117.494
1 & 2
Ours
0.079063
12.947
the safe probability. We use the distilled DNNs [4] of the ones in the ACAS Xu
set.
Chunk 136
All of the new DNNs have tanh activation function. Moreover, we consider
the parallel setting for both our method and BaB.
Chunk 137
The hyperparameters are the
same as in the original ACAS Xu experiments. The comparison is made in the following way.
Chunk 138
We set up an end time of 1800
seconds to let both of the methods find probabilistic hulls. Each of the methods
may terminate either at the end time or when the early termination condition
holds (i.e., the highest probability of the undetermined sets is less than the given
threshold ε, similar to maxR∈Runknown P(R) ≤ε).
Chunk 139
Then we compare the runtime
and the estimated upper and lower bounds for the safe probability. Clearly, the
tighter the bounds, the better the result.
Chunk 140
We use ε = 10−5, and the experimental results are given in Table 4. It can be
seen that our method always stops before the end time and the runtime are much
lower than the basic BaB method.
Chunk 141
On the other hand, the bounds obtained from
our method are also much tighter than those from the basic BaB ones. Hence,
our method is much more efficient in finding safe and unsafe probabilistic hulls.
Chunk 142
5.5
Discussion
Basically, our method can handle most of the feedforward DNNs with various
activation functions, since we use the given DNN as a blackbox to produce the
samples. This is a clear advantage over the state-of-the-art tools such as ProbStar
which requires neural networks only to have ReLU activation functions.
Chunk 143
On the
other hand, our boundary-aware subdivision demonstrate a great performance
improvement in not only time efficiency but also accuracy comparing to the basic
branch-and-bound method. In the experiments, we also find two weaknesses of our method.
Chunk 144
First, the
total runtime highly depends on the time spent by CROWN which is used to
verify the safety of a hull in the input space. It can be improved by designing a
more efficient output range analysis technique which is particularly for the hulls
--- Page 15 ---
Probabilistic Verification of Neural Networks
15
Table 4.
Chunk 145
Efficiency of Finding Probabilistic Hulls (parallel). Legend - Time: runtime
in seconds, End: end time = 1800 seconds, Others are the same as those in Table 1.
Chunk 146
Net Approach
Ls
Us
Us −Ls Time (s)
1-6
Ours
0.9815
0.9995 0.0180
264.00
1-6
BaB
0.9397
1.0000
0.0603
End
2-2
Ours
0.9819 0.9997 0.0178
142.30
2-2
BaB
0.9616
1.0000
0.0384
End
2-9
Ours
0.9502
0.9994 0.0492
686.66
2-9
BaB
0.8896
1.0000
0.1104
End
3-1
Ours
0.9865
1.0000
0.0135
68.06
3-1
BaB
0.9845
1.0000
0.0155
1413.22
3-6
Ours
0.9293
0.9976 0.0682 1188.63
3-6
BaB
0.7738
1.0000
0.2262
End
3-7
Ours
0.8673
0.9448 0.0775 1338.55
3-7
BaB
0.7682
1.0000
0.2318
End
4-1
Ours
0.9864
1.0000 0.0136
59.52
4-1
BaB
0.9837
1.0000
0.0163
1019.67
4-7
Ours
0.9778
1.0000 0.0222
397.20
4-7
BaB
0.8993
1.0000
0.1007
End
5-3
Ours
0.9823
1.0000 0.0177
303.09
5-3
BaB
0.8886
1.0000
0.1114
End
in our problem. Second, our method still suffers from the curse of dimensional-
ity in the worst case, although the boundary-aware subdivision helps a lot on
avoiding subdividing the non-critical regions.
Chunk 147
This aspect can be improved by
considering symbolic representations of the hulls, such as Taylor models [5]. Besides, we observed that some hulls only have a very small intersection
with the boundary but requires many subdivisions to find purely safe or unsafe
subsets.
Chunk 148
For such a hull, we consider to develop shrinking methods that can fast
find subsets by contracting the original hull. 6
Conclusion
We present a novel approach of probabilistic verification of DNNs and the main
idea is to efficiently find safe or unsafe probabilistic hulls using a boundary-aware
subdivision strategy.
Chunk 149
Our method does not have a limitation to particular acti-
vation functions and shows a much better performance than the similar existing
techniques. In the future, we seek to improve the method by (i) developing more ef-
ficient output range computation techniques particularly for the probabilistic
hulls, to make the verification step less time-consuming; (ii) designing new sym-
bolic representations for the hulls such that they are not limited to boxes or
parallelotopes.
Chunk 150
--- Page 16 ---
16
J. Li et al.
Chunk 151
References
1. Bain, L.J., Engelhardt, M.: Introduction to Probability and Mathematical Statis-
tics (2nd ed.).
Chunk 152
Cengage Learning (1992)
2. Boetius, D., Leue, S., Sutter, T.: Probabilistic verification of neural networks using
branch and bound.
Chunk 153
CoRR abs/2405.17556 (2024)
3. Cardelli, L., Kwiatkowska, M., Laurenti, L., Paoletti, N., Patane, A., Wicker, M.:
Statistical guarantees for the robustness of bayesian neural networks.
Chunk 154
In: Proceed-
ings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,
IJCAI 2019, Macao, China, August 10-16, 2019. pp.
Chunk 155
5693–5700 (2019)
4. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey.
Chunk 156
Int. J.
Chunk 157
Comput. Vis.
Chunk 158
129(6), 1789–1819 (2021)
5. Huang, C., Fan, J., Chen, X., Li, W., Zhu, Q.: POLAR: A polynomial arithmetic
framework for verifying neural-network controlled systems.
Chunk 159
In: Bouajjani, A., Holík,
L., Wu, Z. (eds.) Proceedings of the 20th International Symposium on Automated
Technology for Verification and Analysis.
Chunk 160
Lecture Notes in Computer Science, vol. 13505, pp.
Chunk 161
414–430. Springer (2022)
6.
Chunk 162
Jaulin, L., Kieffer, M., Didrit, O., Walter, E.: Applied Interval Analysis. Springer
(2001)
7.
Chunk 163
Katz, G., Barrett, C.W., Dill, D.L., Julian, K.D., Kochenderfer, M.J.: Reluplex: An
efficient SMT solver for verifying deep neural networks. In: The 29th International
Conference on Computer Aided Verification (CAV 2017).
Chunk 164
vol. 10426, pp.
Chunk 165
97–117. Springer (2017)
8.
Chunk 166
Manfredi, G., Jestin, Y.: An introduction to acas xu and the challenges ahead. In: 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC).
Chunk 167
pp. 1–9
(2016)
9.
Chunk 168
Mangal, R., Nori, A.V., Orso, A.: Robustness of neural networks: a probabilis-
tic and practical approach. In: Proceedings of the 41st International Conference
on Software Engineering: New Ideas and Emerging Results, ICSE (NIER) 2019,
Canada.
Chunk 169
pp. 93–96 (2019)
10.
Chunk 170
Meng, M.H., Bai, G., Teo, S.G., Hou, Z., Xiao, Y., Lin, Y., Dong, J.S.: Adversarial
robustness of deep neural networks: A survey from a formal verification perspective. CoRR abs/2206.12227 (2022)
11.
Chunk 171
Mitchell, T.M.: Machine Learning. McGraw-Hill Education (1997)
12.
Chunk 172
Pilipovsky, J., Sivaramakrishnan, V., Oishi, M., Tsiotras, P.: Probabilistic verifica-
tion of relu neural networks via characteristic functions. In: Learning for Dynamics
and Control Conference, L4DC 2023, 15-16 June 2023, Philadelphia, PA, USA.
Chunk 173
Proceedings of Machine Learning Research, vol. 211, pp.
Chunk 174
966–979 (2023)
13. Tambon, F., Laberge, G., An, L., Nikanjam, A., Mindom, P.S.N., Pequignot, Y.,
Khomh, F., Antoniol, G., Merlo, E., Laviolette, F.: How to certify machine learning
based safety-critical systems?
Chunk 175
A systematic literature review. Autom.
Chunk 176
Softw. Eng.
Chunk 177
29(2), 38 (2022)
14. Tran, H., Choi, S., Okamoto, H., Hoxha, B., Fainekos, G., Prokhorov, D.V.: Quan-
titative verification for neural networks using probstars.
Chunk 178
In: Proceedings of the
26th ACM International Conference on Hybrid Systems: Computation and Con-
trol, HSCC 2023, USA,. pp.
Chunk 179
4:1–4:12 (2023)
15. Wang, S., Zhang, H., Xu, K., Lin, X., Jana, S.S., Hsieh, C.J., Kolter, J.Z.: Beta-
crown: Efficient bound propagation with per-neuron split constraints for neural
network robustness verification.
Chunk 180
In: 34th Annual Conference on Neural Information
Processing Systems, (NeurIPS 2021). pp.
Chunk 181
29909–29921 (2021)
--- Page 17 ---
Probabilistic Verification of Neural Networks
17
16. Webb, S., Rainforth, T., Teh, Y.W., Kumar, M.P.: A statistical approach to as-
sessing neural network robustness.
Chunk 182
In: 7th International Conference on Learning
Representations, USA (2019)
17. Weng, T.W., Chen, P.Y., Nguyen, L.M., Squillante, M.S., Boopathy, A., Oseledets,
I., Daniel, L.: Proven: Certifying robustness of neural networks with a probabilistic
approach 97 (2019)
18.
Chunk 183
Wicker, M., Laurenti, L., Patane, A., Paoletti, N., Abate, A., Kwiatkowska, M.:
Probabilistic reach-avoid for bayesian neural networks. Artif.
Chunk 184
Intell. 334, 104132
(2024)
19.
Chunk 185
Wicker, M., Patane, A., Laurenti, L., Kwiatkowska, M.: Adversarial robustness
certification for bayesian neural networks. In: Formal Methods - 26th International
Symposium, FM 2024, Milan, Italy, September 9-13, 2024, Proceedings, Part I.
Chunk 186
Lecture Notes in Computer Science, vol. 14933, pp.
Chunk 187
3–28 (2024)
20. Wu, H., Isac, O., Zeljic, A., Tagomori, T., Daggitt, M.L., Kokke, W., Refaeli,
I., Amir, G., Julian, K., Bassan, S., Huang, P., Lahav, O., Wu, M., Zhang, M.,
Komendantskaya, E., Katz, G., Barrett, C.W.: Marabou 2.0: A versatile formal
analyzer of neural networks.
Chunk 188
In: Computer Aided Verification - 36th International
Conference, CAV 2024, Montreal, QC, Canada, July 24-27, 2024, Proceedings, Part
II. Lecture Notes in Computer Science, vol.
Chunk 189
14682, pp. 249–264 (2024)
21.
Chunk 190
Xu, K., Shi, Z., Zhang, H., Wang, Y., Chang, K., Huang, M., Kailkhura, B., Lin,
X., Hsieh, C.: Automatic perturbation analysis for scalable certified robustness and
beyond. In: Advances in Neural Information Processing Systems 33: Annual Con-
ference on Neural Information Processing Systems 2020, NeurIPS 2020, December
6-12, 2020, virtual (2020)
22.
Chunk 191
Yang, X., Yamaguchi, T., Tran, H., Hoxha, B., Johnson, T.T., Prokhorov, D.V.:
Neural network repair with reachability analysis. In: Formal Modeling and Analy-
sis of Timed Systems - 20th International Conference, FORMATS 2022, Warsaw,
Poland, September 13-15, 2022, Proceedings.
Chunk 192
Lecture Notes in Computer Science,
vol. 13465, pp.
Chunk 193
221–236. Springer (2022)
23.
Chunk 194
Zhang, H., Weng, T.W., Chen, P.Y., Hsieh, C.J., Daniel, L.: Efficient neural net-
work robustness certification with general activation functions. In: 31st Annual
Conference on Neural Information Processing Systems 2018, (NeurIPS 2018).
Chunk 195
pp. 4944–4953 (2018)
A
Hyperparameter and Grid Search
We performed an extensive grid search to determine the best configuration for the
core components of our approach: the sampling strategy, the impurity scheduler,
and the regression tree depth.
Chunk 196
A.1
ACAS Xu Benchmark
For the ACAS Xu benchmark, the search space was defined as follows. – Sampling Weights (wuniform, wdistribution): Five configurations were tested.
Chunk 197
(0.0, 1.0), (0.25, 0.75), (0.5, 0.5), (0.75, 0.25), and (1.0, 0.0). – Regression Tree Depth: Two values were considered.
Chunk 198
5 and 10. – Impurity Splitting (β): Five values were tested.
Chunk 199
0, 0.25, 0.5, 0.75, and 1.0. --- Page 18 ---
18
J.
Chunk 200
Li et al. – Impurity Coefficient (α): Six values were evaluated.
Chunk 201
0.05, 0.1, 0.3, 0.5, 1.0,
and 2.0. This parameter space resulted in 25 impurity configurations and a total of 250
parameter combinations.
Chunk 202
The evaluation metrics for each combination were the
remaining unknown region probability (Us −Ls) and the corresponding runtime. The grid search was performed on ACAS network 1-6 for specification
2.
Chunk 203
Figure 4 presents the results, mapping the trade-off between Us −Ls (vertical
axis) and runtime (horizontal axis). Each point represents a distinct configura-
tion.
Chunk 204
The red star highlights the Pareto optimal configuration, which outper-
forms all other combinations in both metrics. Gray points indicate configurations
that are dominated by the Pareto solution in at least one metric.
Chunk 205
The search identified the optimal configuration for the ACAS benchmark as
pure distributional sampling (0.0, 1.0), a regression tree depth of 5, β = 0.75, and
α = 0.05. These settings were subsequently adopted for the main experiments
on the ACAS Xu benchmark.
Chunk 206
0
200
400
600
800
1000
1200
Runtime (seconds)
0.05
0.10
0.15
0.20
Us-Ls
Pareto Optimal
((0.00, 1.00), 5, (0.05, 0.75))
Other Configurations
Pareto Optimal (Best & Fastest)
Fig. 4.
Chunk 207
Scatter plot showing the relationship between Us-Ls and runtime for 250 dif-
ferent parameter configurations tested in the grid search. A.2
Rocket Lander Benchmark
The same systematic grid search methodology was applied to the Rocket Lander
controller.
Chunk 208
The search space for this benchmark was defined as follows. – Sampling Weights (wuniform, wdistribution).
Chunk 209
The same five configurations
were tested. (0.0, 1.0), (0.25, 0.75), (0.5, 0.5), (0.75, 0.25), and (1.0, 0.0).
Chunk 210
--- Page 19 ---
Probabilistic Verification of Neural Networks
19
20
40
60
80
100
Runtime (seconds)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Us-Ls
Pareto Optimal
((0.25, 0.75), 5, (0.05, 0.00))
Other Configurations
Pareto Optimal (Best & Fastest)
Fig. 5.
Chunk 211
Scatter plot showing the relationship between Us-Ls and runtime for 300 dif-
ferent parameter configurations tested in the grid search. – Regression Tree Depth.
Chunk 212
Three values were considered. 5, 10, and 20.
Chunk 213
– Impurity Splitting (β). Five values were tested.
Chunk 214
0, 0.25, 0.5, 0.75, and 1.0. – Impurity Coefficient (α).
Chunk 215
Five values were evaluated. 0.05, 0.1, 0.3, 0.5,
and 1.0.
Chunk 216
This yielded a total of 300 parameter combinations. The grid search was per-
formed on Rocket Net 1 for specification 1.
Chunk 217
Figure 5 presents the corresponding Pareto front analysis. Consistent with
the ACAS results, the Us −Ls is plotted against the runtime, and the red star
marks the Pareto optimal configuration.
Chunk 218
The optimal configuration determined for the Rocket Lander benchmark was
a *mixed sampling strategy (0.25, 0.75), a regression tree depth of 5, β = 0.0,
and α = 0.05**. These settings were adopted for the main experiments on the
Rocket Lander benchmark.
Chunk 219
B
Ablation Studies
We conduct a comprehensive ablation study to analyze the independent con-
tribution and combinatorial effects of the key components in our proposed ap-
proach. The components investigated include the weighted mixed sampling and
the impurity scheduler defined by α and β in our hyperparameter settings.
Chunk 220
The
ablation studies are conducted on both the high-dimensional Rocket Lander and
the moderate-dimensional ACAS Xu benchmarks. The following sections detail
the findings, quantifying the impact of each mechanism on verification tightness
(Us −Ls) and time efficiency across these diverse benchmarks.
Chunk 221
--- Page 20 ---
20
J. Li et al.
Chunk 222
Table 5. Ablation Study on Rocket Lander and ACAS Xu (Average Results).
Chunk 223
The
table compares the average unknown probability (Us −Ls) and average runtime for
different component configurations on two benchmarks. The weights are wu (uniform)
and wd (distributional).
Chunk 224
Configuration
(wu, wd)
Rocket Lander
ACAS Xu
Avg. Us −Ls Avg.
Chunk 225
Time (s) Avg. Us −Ls Avg.
Chunk 226
Time (s)
Baseline Configurations (Impurity Scheduler Disabled)
Baseline
(1.00, 0.00)
0.9194
30.26
0.0650
279.46
(0.00, 1.00)
0.9242
24.86
0.0681
279.05
(0.75, 0.25)
0.9134
30.74
0.0652
277.95
(0.50, 0.50)
0.9124
29.22
0.0664
279.09
(0.25, 0.75)
0.9124
27.80
0.0662
278.14
Configurations with Impurity Scheduler Enabled
+ Impurity Scheduler
(1.00, 0.00)∗
0.2466
23.17
0.0341
211.61
(0.00, 1.00)
0.3294
25.23
0.0368
221.05
(0.75, 0.25)
0.2371
21.25
0.0340
210.33
(0.50, 0.50)
0.2705
22.02
0.0344
214.28
(0.25, 0.75)
0.2576
21.71
0.0350
218.96
∗For ACAS Xu, the impurity scheduler uses β = 0.75 and α = 0.05 as the
representative Impurity-only configuration for comparison. For Rocket Lander, the
impurity scheduler uses β = 0.0 and α = 0.05 as the representative Impurity-only
configuration for comparison.
Chunk 227
B.1
Contribution of Sampling and Impurity Scheduling
The analysis of results presented in Table 5 focuses on the core question of
whether sampling alone is effective and how the Impurity Scheduler contributes
to the final performance. The data reveals that the Impurity Scheduler is the
critical component for achieving high verification accuracy (Us −Ls).
Chunk 228
When the Impurity Scheduler is disabled, varying the sampling ratio only
provides marginal improvements in Us −Ls. For the Rocket Lander, the best
sampling ratio (wu = 0.25 or wu = 0.50) achieves 0.9124, barely better than
the worst sampling ratio (0.9194).
Chunk 229
For ACAS Xu, the sampling variance shows
negligible impact on accuracy. However, enabling the Impurity Scheduler changes the performance profile.
Chunk 230
Firstly, for the pure uniform sampling case (wu = 1.00), Us −Ls is reduced
from 0.9194 to 0.2466 for Rocket Lander and from 0.0650 to 0.0341 for ACAS
Xu. This confirms that the scheduler can significantly tighten the verification
bounds.
Chunk 231
Furthermore, when the Impurity Scheduler is active, optimizing the mixed
sampling ratio becomes relevant, providing a measurable gain. For both bench-
marks, the (0.75, 0.25) mixed ratio achieves the tightest overall bound (0.2371
for Rocket Lander and 0.0340 for ACAS Xu).
Chunk 232
This indicates that while sampling
is not the main driver of accuracy, an optimal mix of uniform and distributional
sampling is necessary for the final refinement. --- Page 21 ---
Probabilistic Verification of Neural Networks
21
Table 6.
Chunk 233
Ablation Study on Impurity Scheduler Threshold (β). The table compares
the performance (Us −Ls and Runtime) of the Impurity Scheduler by varying the
threshold β, which controls the transition from longest-dimension splitting (β = 1.0)
to soft-coded impurity splitting (β < 1.0).
Chunk 234
Only pure uniform sampling (wu = 1.00) is
used. Threshold β
Splitting Mode
Rocket Lander
ACAS Xu
Avg.
Chunk 235
Us −Ls Avg. Time (s) Avg.
Chunk 236
Us −Ls Avg. Time (s)
Baseline (β Implied)
Longest-Dimension
0.9194
30.26
0.0650
279.46
0.75
Late Activation
0.9194
30.24
0.0341
211.61
0.50
Mid Activation
0.9194
30.53
0.0342
195.23
0.25
Early Activation
0.9194
30.29
0.0378
210.71
0.00
Immediate Soft Impurity
0.2466
23.17
0.0610
349.11
Regarding runtime efficiency, the runtime for nearly all configurations, re-
mains within a similar order of magnitude on both benchmarks.
Chunk 237
For instance,
the Rocket Lander configurations generally fall between 21s and 31s, and the
ACAS Xu configurations are mostly between 210s and 280s. This suggests that
the major performance contribution of the Impurity Scheduler lies in the massive
improvement of accuracy (Us −Ls) convergence, while overall execution time
remains relatively constant across effective configurations.
Chunk 238
B.2
Ablation on Impurity Scheduler Threshold (β)
Given the critical role of the Impurity Scheduler, we conduct a focused ablation
study by varying the threshold β, which controls the strategy for transitioning
from standard longest-dimension splitting to the proposed soft-coded impurity
measure (α-weighted splitting). The results, using pure uniform sampling (wu =
1.00), are summarized in Table 6.
Chunk 239
The optimal transition strategy is found to be
benchmark-dependent. For Rocket Lander benchmark, the critical finding is that the Immediate
Soft Impurity (β = 0.00) mode is optimal.
Chunk 240
This configuration achieves the
tightest bound of 0.2466 and the best runtime of 23.17s within this set. This
suggests that the longest-dimension splitting provides minimal structural benefit,
and immediate application of the boundary-aware measure is key.
Chunk 241
As for ACAS Xu, a hybrid splitting strategy is essential. The Immediate
Soft Impurity (β = 0.00) yields a poor Us −Ls of 0.0610 and the longest runtime
of 349.11s.
Chunk 242
In contrast, the best accuracy (0.0341) is achieved when β = 0.75
(Late Activation), allowing the longest-dimension splitting to run for the initial
portion of the verification. The best runtime (195.23s) is achieved at β = 0.50
(Mid Activation).
Chunk 243
This confirms that for ACAS Xu, initial global refinement via
standard splitting significantly accelerates the process before the fine-grained,
boundary-aware splits are activated. The distinct optimal β values can be explained by the fundamental differences
in input dimensionality and the subsequent geometric challenges presented to the
verification method (which relies on convex hull approximations):
--- Page 22 ---
22
J.
Chunk 244
Li et al. 1.
Chunk 245
Geometric Control: In methods utilizing convex hull approximations, the
Longest-Dimension splitting is employed primarily to maintain a low inter-
dimension ratio for the divided regions. Preventing regions from becoming
highly elongated is vital, as this mitigates the over-approximation error that
occurs when hull methods are applied to non-square regions.
Chunk 246
2. High Dimensionality (Rocket Lander): The 9-dimensional space of
Rocket Lander limits the effectiveness of pure Longest-Dimension splitting.
Chunk 247
The soft impurity measure from the beginning is necessary because its ge-
ometric penalty term (Lα) simultaneously guides the split toward reducing
boundary uncertainty (via MSE) and acts to regularize the region’s aspect
ratio by penalizing splits across large dimensions (L). This combined effi-
ciency makes immediate activation optimal, as the structural splitting stage
provides little benefit.
Chunk 248
3. Initial Refinement Need (ACAS Xu): In the lower 5-dimensional ACAS
Xu space, the Longest-Dimension splitting remains an efficient method for
achieving an initial, coarse, global subdivision.
Chunk 249
This initial phase (enabled by
β = 0.75 or β = 0.50) establishes regions with more uniform aspect ratios. This preprocessing ensures that when the fine-grained soft impurity measure
is activated later, it operates on geometrically well-behaved regions, thereby
improving both accuracy and overall convergence speed.
Chunk 250
This analysis confirms that our impurity scheduler is a highly effective compo-
nent, but its optimal configuration requires careful tuning based on the specific
dimensional and structural characteristics of the benchmark.