Research Papers

Probabilistic Verification Of Neural Networks Via Efficient Probabilistic Hull Generation

Document ID: research-papers-probabilistic-verification-of-neural-networks-via-efficient-probabilistic-hull-generation

Full content

--- Page 1 --- Probabilistic Verification of Neural Networks via Efficient Probabilistic Hull Generation Jingyang Li1, Xin Chen2, Hongfei Fu1, and Guoqiang Li1 1 Shanghai Jiao Tong University, Shanghai 200240, China {lijjjjj, jt002845, li.g}@sjtu.edu.cn 2 University of New Mexico, Albuquerque, NM 87106, USA chenxin@unm.edu Abstract. The problem of probabilistic verification of a neural network investigates the probability of satisfying the safe constraints in the out- put space when the input is given by a probability distribution. It is significant to answer this problem when the input is affected by distur- bances often modeled by probabilistic variables. In the paper, we propose a novel neural network probabilistic verification framework which com- putes a guaranteed range for the safe probability by efficiently finding safe and unsafe probabilistic hulls. Our approach consists of three main innovations: (1) a state space subdivision strategy using regression trees to produce probabilistic hulls, (2) a boundary-aware sampling method which identifies the safety boundary in the input space using samples that are later used for building regression trees, and (3) iterative refine- ment with probabilistic prioritization for computing a guaranteed range for the safe probability. The accuracy and efficiency of our approach are evaluated on various benchmarks including ACAS Xu and a rocket lan- der controller. The result shows an obvious advantage over the state of the art. Keywords: Neural Network · Formal Verification · Probabilistic Verifi- cation. 1 Introduction Formal verification of deep neural networks (DNNs) is crucial for ensuring the trustworthiness and the reliability of learning-enabled systems in safety-critical scenarios [10, 13]. While most existing techniques aim at solving the output range analysis problem of DNNs [7, 20, 23], it is still difficult to know the safety of a DNN when the input is affected by noises that are often modeled by random variables. Since noises or disturbances are often described by Gaussian variables, their values are not bounded and can hardly be handled by existing output range analysis methods. We propose a new approach which generates a guaranteed range for the safety of a DNN when its input is Gaussian variables. Probabilistic verification [17, 9, 16, 2, 14] involves analyzing the probability of safety violations under specified input distributions instead of giving a determin- istic answer to the safety query. In order to estimate the probability of satisfying arXiv:2604.21556v1 [cs.AI] 23 Apr 2026 --- Page 2 --- 2 J. Li et al. the safe constraints in the output space of a DNN, a straightforward approach could be to compute the output distribution and then integrate its density func- tion over the safe output space. However, it is often not feasible due to the complexity of DNNs. In this paper, we propose a method which conservatively estimate the safety of a DNN based on a set of safe or unsafe probabilistic hulls that at most intersect on facets. A probabilistic hull is a closed and bounded set in the input space whose orientation is aligned with the Gaussian distribution of the input, such that the probability of it can be efficiently computed based on the error function [1]. An upper bound of safety can be obtained by subtracting the probabilities of all probabilistic hulls which are purely unsafe, while a lower bound can be derived by adding the probabilities of all safe hulls. The main contribution of our method is an efficient search strategy for finding large safe and unsafe hulls. Existing probabilistic neural network verification methods suffer from the curse of dimensionality. Typically, the existing set-based techniques [23, 15] often prioritize partitioning the input space uniformly that cannot recognize critical regions. This leads to exponential amount of subdivisions and is time consuming. Other methods, such as ProbStar [14] symbolically propagates the input set to the output space and then estimate their safe probability, it is able to more accurately track the input-output dependency however cannot handle the DNNs whose activations are not ReLU. We study the probabilistic verification problem on feedforward DNNs. We propose a novel regression tree-guided probabilistic verification framework that achieves both efficiency and generality. Our key insight is that neural network decision/safe boundaries, i.e., the boundary dividing the safe and unsafe set in the output space, can be effectively approximated through adaptive region par- titioning guided by sampling-based regression trees, eliminating the unnecessary exploration and subdivision in the regions that are purely safe or unsafe. Our approach makes three primary contributions: (1) We introduce a regression tree- guided state space splitting strategy for the input space that efficiently generates purely safe or unsafe probabilistic hulls of large probabilities. It avoids the un- necessary subdivisions in the safe or unsafe regions in most practical cases. (2) We propose a boundary-aware sampling method which effectively identifies the safety boundary in the input space by random samples which are later used to build regression trees. (3) We develop an incremental sampling mechanism that iteratively refines the undecided regions with the largest probability, avoiding the prohibitive cost of dense sampling across the entire input space. Experimen- tal evaluation on the challenging benchmarks demonstrates that our method achieves superior accuracy against the state of the art and achieves up to 10× speedup compared to basic branch-and-bound approaches. 2 Related Work Linear bound propagation techniques, exemplified by CROWN [23], have rev- olutionized neural network verification by extending traditional interval bound --- Page 3 --- Probabilistic Verification of Neural Networks 3 propagation to linear function-based bound propagation. This advancement sig- nificantly improves scalability while maintaining reasonable tightness for veri- fication tasks. Building upon CROWN, β-CROWN [15] integrates branch-and- bound (BaB) techniques with GPU acceleration to achieve both completeness and scalability in verification processes. Probabilistic verification has emerged as a critical paradigm for quantify- ing neural network safety under stochastic inputs. One part of the research focus on Bayesian neural network verification [3, 18, 19]. Early approaches like PROVEN [17] established distribution-aware robustness guarantees by integrat- ing probabilistic measures with formal verification techniques. Subsequently, hy- brid methods [9] combined abstract interpretation with Monte Carlo sampling to balance computational efficiency with probabilistic assurances. Statistical evalu- ation frameworks [16] further advanced the field by employing adaptive sampling strategies to estimate perturbation tolerance while quantifying violation proba- bilities. Besides, [12] proposes a characteristic-function-based probabilistic ver- ification method which estimates the safe probability via frequency-domain. Recent developments have focused on scalability. The probabilistic branch- and-bound approach [2] iteratively refines probability bounds through dynamic region subdivision, achieving improved verification success rates but suffering from exponential complexity in high-dimensional spaces. Most notably, Prob- Star [14] introduces probabilistic star-based reachability analysis for efficient uncertainty propagation in safety-critical systems. However, ProbStar is limited to ReLU networks due to its reliance on linear region enumeration, restricting its applicability to networks with generic activation functions. 3 Preliminaries A deterministic input to a DNN can be denoted by a column vector x = (x1, . . . , xd). Therefore, a noise-affected input can be denoted by x + w where w = (w1, . . . , wd) are the noises for each input variable. In this paper, we fo- cus on Gaussian noises, i.e., w subject to a multivariate Gaussian distribution, since they are the most popular models for noises in practice. Since Gaussian distributions are closed under affine transformations, a noise-affected input can be represented by Gaussian variables, i.e., x ∼N(µ, Σ) where µ is the mean and Σ is the covariance matrix. Problem Formulation. Let f : Rd →Rm denote the input-output mapping of a (feedforward) DNN which has d inputs and m outputs. Given a Gaussian input x ∼N(µ, Σ) and a safety property in the output space: Γ(y) : Cy ≥a, such that the property is defined by a finite set of affine constraints/inequalities. Here, y is a point/vector in the output space, i.e., y = f(x), C ∈Rk×m is a matrix of coefficients, and a ∈Rk is a constant vector. The DNN probabilistic verification problem asks to estimate the probability P(Γ(y) | x ∼N(µ, Σ)). (1) --- Page 4 --- 4 J. Li et al. As we pointed out that an exact calculation of the probability (1) is in- tractable. Therefore, we conservatively estimate an upper bound as well as a lower bound for the actual probability via computing a finite set of safe and unsafe probabilistic hulls. Given a set of Gaussian variables x ∼N(µ, Σ) such that Σ = diag(σ2 1, . . . , σ2 d), i.e., all variables subject to independent distributions. The probability P(x ∈H), or simply P(H), where the hull H is a box [a1, b1]×· · ·×[ad, bd] can be computed by evaluating the expression Q 1≤i≤d P(ai ≤xi ≤bi) such that P(ai ≤xi ≤bi) = 1 2 erf(bi −µ √ 2σi ) −erf(ai −µ √ 2σi ) where erf is the error function [1]. For any full rank covariance matrix Σ, a hull H aligned with its orientation can be calculated in a similar way. That is, we may transform Σ to a diagonal matrix, transform H to a box using the same mapping and evaluate the probability of being in H with the transformed distribution. Hence, for simple denotation, we assume that Σ is diagonal. In a DNN verification problem, a hull H is safe (unsafe resp.) when all points in its output set f(H) satisfying (violating resp.) the safe property Γ. Input f, Γ, µ, Σ, ε Initialize Runknown = [Space] Rsafe, Runsafe = ∅ P R∈Runknown P(R) ≤ε? Output Safe probability interval [Ls, Us] Select Region Region Rnext with largest probability in Runknown Boundary-Aware Subdivision Extract probabilsitic hulls in Rnext Verify and Update Region Sets Add to Rsafe/Runsafe/Runk No Yes Fig. 1. Workflow of our regression tree-guided probabilistic verification framework. 4 Main Approach The main idea of our approach for estimating conservative upper and lower bounds for the actual safe probability is to iteratively construct a set of safe hulls: Hs 1, . . . , Hs N and a set of unsafe hulls: Hu 1 , . . . , Hu M, such that all hulls intersect at most on the facets. Therefore, each hull represents an independent probability from the others. Then, the upper bound is computed as Us = 1−P 1≤i≤M P(Hu i ), while the lower bound is obtained as Ls = P 1≤i≤N P(Hs i ). Theorem 1. If Hs 1, . . . , Hs N are safe and Hu 1 , . . . , Hu M are unsafe, then we have that Us ≥Ls, and the actual safe probability must be in the interval [Ls, Us]. --- Page 5 --- Probabilistic Verification of Neural Networks 5 Obviously, if the probabilistic hulls can be efficiently found and have high probabilities, an accurate range for safe probability can be obtained. Unlike the related methods, the probabilistic hulls are not generated from random samples. Instead, we build (safe) boundary-ware regression trees which subdivide the input space into box regions, then we use CROWN to verify the safety of them and identify the safe and unsafe subregions as hulls. In the experiments, we show that such a strategy greatly outperforms the method using uniform sampling. The high-level workflow of our framework is given in Figure 1. Given a DNN f, a safety property Γ, a Gaussian distribution N(µ, Σ) for the input, and a ter- mination threshold ε > 0. Here, we only need to use the DNN as a blackbox. The main algorithm keeps three sets of probabilistic hulls: Rsafe, Runsafe, Runknown that are used to keep the set of safe, unsafe and unknown hulls respectively. Ini- tially, Runknown has a single element which represents the whole state space, while Rsafe, Runsafe are both empty. The whole state space can be represented by a large bounded set which is of a probability sufficiently close to 1 (e.g. ≥99.999%) for the inputs, instead of an unbounded support range from Gaussian distributions. Then the main loop performs the following steps: (1) It chooses the hull Rnext which has the highest probability in Runknown and delete it from Runknown; (2) Rnext is subdivided using our boundary-aware subdivision strategy which builds a regression tree and the subdivisions are new probabilistic hulls obtained as the tree leafs; (3) We verify the output sets of the subdivisions. We add the safe hulls to Rsafe, the unsafe hulls to Runsafe, and the hulls that are not purely safe or unsafe to Runknown. The loop terminates when the summation of the proba- bilities of the hulls in Runknown is < ε. The details are described in the following subsections. 4.1 Boundary-Aware Subdivision Strategy Uniform Subdivision Unk Unk Safe Unk Boundary-Aware Subdivision Unk Safe Unsafe Unk — Boundary - - Splits • Samples Fig. 2. Uniform vs. boundary-aware subdivision. Efficiently finding large safe/unsafe hulls plays a key role in our framework. A naive way could be uniformly subdivide the input state space and verify the safety of each piece. However, it is often very expensive and ignores the decision boundary. Directly considering the boundary in probabilistic hull search can be --- Page 6 --- 6 J. Li et al. difficult, since the boundary is in the output space and the subdivision in per- formed in the input space. However, it is still possible to “backpropagate” the information of the boundary to the subdivision process. Figure 2 gives an intu- itive comparison between uniform subdivision and boundary-aware subdivision. The red curve indicates the backpropagated boundary to the input space, it is often not computable and unknown to our process. It can be seen from the figure that, using boundary-aware subdivision can better produce large hulls which do not intersect the boundary and are pure safe or unsafe. Our boundary-aware subdivision method is outlined in Algorithm 1. Given a DNN f, a region R in the input space, number of samples n and the safety specification defined by the parameters C and a. Our whole boundary-ware sub- division process performs the following steps: (1) It samples a point set Sbase with n points within the given region R and check the safety of their outputs; (2) If both safe and unsafe outputs are found, it conducts boundary-aware sampling by rejecting the samples whose outputs are sufficiently far from the boundary, and we obtain a point set S; (3) Otherwise, we set S = Sbase since the boundary might not be in R; (4) We use the samples in S to build a regression tree T in the region R, where each leaf represents a subdivision of R. The details of boundary-aware sampling are given as below. Combined Sampling Strategy Foundation. We introduce the setup of our base sampling strategy. Our base sampling strategy is the combination of the distribu- tional sampling of the input distribution N(µ, Σ) and the uniform distribution. Given an integer n, we sample ⌊n/2⌋points with distributional sampling and sample n −⌊n/2⌋points with uniform sampling. We adopt the combination in- stead of only one of them because (i) we need to take the distribution of the input variables into account since it helps to find probabilistic hulls in the space of high probability density and make the hulls of high probability, while (ii) ad- ditionally considering a uniform distribution helps us to sample the margin of a region, and have balanced safe and unsafe samples. Boundary Awareness Through Elimination. In order to better reflect the safe boundary in R, we check our samples to see if both safe and unsafe points are generated. If not, we do not start the rejection sampling step. Otherwise, we aim at keeping the samples near the boundary. Although it is difficult to compute the boundary in the input space, the distance of a sample x to the boundary can be measured as the distance between f(x) and the boundary. Then we apply an elimination mechanism to remove the samples that are sufficiently far from the boundary. To do so, we associate a probability distribution for eliminating a sample. For each sample x, we compute its output f(x) and the distance to the safety boundary Cf(x) −a. Then, the elimination probability is given by Pe(x) = 1 −exp(−rank(|Cf(x) −a|/σ)) where rank returns the corresponding index in the sorted distance from the nearest to the farthest. --- Page 7 --- Probabilistic Verification of Neural Networks 7 Algorithm 1: Boundary-Aware Subdivision Input: Region R, number of samples n, network f, C, a Output: Boundary-aware regression tree T // Combined sampling foundation 1 Sbase ←CombinedSample(R, n); // Check if rejection sampling is needed 2 if Sbase contains only safe or unsafe points then // No boundary in region, return base samples 3 return Sbase 4 end // Rejection sampling 5 S ←∅; 6 nattempts ←0; 7 while |S| < n and nattempts < max_attempts do 8 x ←CombinedSample(R, n) 9 xaccept ←EliminateSamples(C, f, x) 10 S ←S ∪{xaccept}; 11 nattempts ←nattempts + 1; 12 end // Fill remaining samples if not enough sample points 13 if |S| < n then 14 S ←S∪CombinedSample(R, n −|S|); 15 end 16 T ←BuildRegressionTree(S, R) 17 return T Regression Tree Construction. We use the sample set after elimination to build a regression tree which is expected to better split the region along the boundary. The tree construction follows the standard algorithm [11]. For each non-leaf node, the impurity of the samples is measured as Impurity = MSE(Sleft, Sright) Lα where Sleft, Sright are the split sets, L is the variation of the samples in the dimension of subdivision, and α is a user-specified hyperparameter controlling the geometric penalty. The MSE is computed as MSE(Sleft, Sright) = Var(yi|xi ∈ Sleft) · |Sleft| + Var(yi|xi ∈Sright) · |Sright|. Each subdivision is performed to minimize the impurity of the two subdivisions. --- Page 8 --- 8 J. Li et al. Algorithm 2: The Main Algorithm Input: Network f, C, a, distribution p(x), threshold ε Output: Safe probability interval [Ls, Us] 1 Initialize: Runknown ←[entire_input_space]; 2 Rsafe ←∅, Runsafe ←∅; 3 while P R∈Runknown P(R) ≥ε do // Select region with maximum probability mass 4 Rnext ←SelectLargestProbRegion(Runknown); // Targeted sampling within selected region 5 S ←BoundaryAwareSample(Rnext, small_sample_size); // Build regression tree and extract leaf regions 6 T ←BuildRegressionTree(S, Rnext); 7 {R1, ..., Rk} ←ExtractLeafRegions(T); // Verify each leaf region 8 {s1, ..., sk} ←CROWNVerify({R1, ..., Rk}, C, a); // Update region sets 9 Remove Rnext from Runknown; 10 for i ←1 to k do 11 if si = Safe then 12 Rsafe ←Rsafe ∪{Ri}; 13 else if si = Unsafe then 14 Runsafe ←Runsafe ∪{Ri}; 15 else 16 Runknown ←Runknown ∪{Ri}; 17 end 18 end 19 end 20 Ls ←P R∈Rsafe P(R); 21 Us ←1 −P R∈Runsafe P(R); 22 return [Ls, Us] Algorithm 2 presents the main algorithm that is a detailed treatment of the workflow in Figure 1. It operates through an iterative refinement loop that progressively subdivides the input space. Rather than attempting to resolve all unknown hulls equally, we strategically select and refine only the candidates with the highest probability mass among the unknown hulls. 4.2 Analysis of the Framework Correctness of the Result. We investigate the correctness of the result from Algorithm 2. First, all of the probabilistic hulls generated intersect at most on the facets. To see that, each hull is obtained as a leaf of a regression tree which subdivide a rougher hull into sub-hulls whose intersections are at most the facets. Therefore, each probabilistic hull in the set Rsafe and Runsafe has a probability which is independent from the probabilities represented by the other --- Page 9 --- Probabilistic Verification of Neural Networks 9 hulls. Next, we show that Us is an over-estimate of the actual safe probability and Ls is an under-estimate of probability. Due to the conservativeness of the tool CROWN, all hulls in Runsafe are unsafe and all hulls in P R∈Rsafe are safe. Since the probabilities of the hulls in Runsafe are independent, we infer that the DNN is at least 1 −Us = P R∈Runsafe P(R) unsafe, and thereby at most Us safe. Analogously, the DNN is at least Ls = P R∈Rsafe P(R) safe since the hulls are safe and representing independent probability. We also have that Ls ≤Us. Termination of the Main Algorithm. We show that Algorithm 2 will eventually terminate with Us−Ls < ε. Our main algorithm can be viewed as a new branch- and-bound scheme with probability density and boundary-aware subdivision. During the iterations, the sum of the volumes of the hulls in the unknown set Runknown becomes smaller and smaller due to the repeated refinement. A proof for the termination can be directly obtained by adapting the termination proof of the standard branch-and-bound method [6]. The volume of the region defined by the union of all undecided hulls, i.e., the hulls in the set of Runknown, converges to zero when the number of iterations goes to infinity. Hence, the total probability of all hulls in Runknown will eventually be below the given threshold ε after a finite number of iterations. It is also the probability that lies out of the confirmed safe and unsafe probability estimates, i.e., Us −Ls < ε. Theorem 2. Algorithm 2 terminates with an interval that contains the actual safe probability of the DNN. When Algorithm 2 terminates, the union of the hulls in Runknown forms an overapproximation of the image of the safety boundary via the inverse mapping of the DNN. 5 Experiments 5.1 Experimental Setup Platform and Tools. Our experiment is conducted on a Linux server running Ubuntu 18.04, equipped with an Intel Xeon E5-2678 v3 processor and a GPU of Nvidia RTX 4090. We also use auto_LiRPA [21] library to compute the bounds, and our implementation takes as input neural networks in PyTorch format. The safety verification of probabilistic hulls are performed by CROWN [23]. Our full source code, scripts, and necessary experimental data are publicly available on GitHub at https://github.com/LIJYann/Regression-Tree-Guided- Probabilistic-Hull-Generation. The repository corresponds to the code version used for this submission, ensuring reproducibility. Comparison Baselines. We compare our approach with two state-of-the-art tools, ProbStar [14] and the basic branch-and-bound (BaB) method [2] (binary partition on the longest dimension). The comparison is done on three groups of DNN benchmarks: ACAS Xu, the rocket lander controller benchmark for SpaceEx Falcon9, and the tanh-based DNNs distilled from those in ACAS Xu --- Page 10 --- 10 J. Li et al. (see [4]). We use a less restrictive condition, maxR∈Runknown P(R) ≤ε, for the termination condition in order to reach a reasonable tradeoff between efficiency and accuracy. This condition is used in both our method and the BaB in order to give a fair comparison. The parallel computation in BaB and our approach are conducted on GPU with auto_LiRPA. However, due to the implementation of ProbStar, it can only be parallelized on CPU, and we use the default parallel setting for ProbStar. The experimental results from the tools are compared based on the runtime and the tightness of the upper and lower bounds for the actual safe probability. Hyperparameter Settings. To systematically evaluate our regression-tree- guided probabilistic verification method across different parameter configura- tions, we conducted a comprehensive grid search on each benchmark. We test all combinations of key parameters through their Cartesian product, enabling us to identify optimal configurations. We examined three critical parameters: sampling strategy, regression tree depth, and impurity hyperparameter settings. Overall, the configurations can be described as ((wuniform, wdistribution), depth, (α, β)): – Sampling strategy: Weight configurations balancing uniform and distribu- tional sampling (wuniform, wdistribution); – Regression tree depth: Depth limits on regression tree; – Impurity scheduler: Combined the proposed α-weighted measure with longest- dimension splitting. Initial splits follow the longest dimension until the hull probability reaches threshold β, then transition to the soft-coded impurity measure (α, β). Fig. 3. Boundary-aware subdivision for the toy example. Illustrative Toy Examples. Before we show the comparisons, Figure 3 il- lustrates the probabilistic hulls computed for the toy example described in [14]. --- Page 11 --- Probabilistic Verification of Neural Networks 11 The red boxes are the probabilistic hulls that are identified as unsafe. The green boxes are the safe hulls. While the yellow ones are the boxes whose corresponding outputs intersect the safety boundary. The union of the yellow boxes forms an overapproximation of the image of the safety boundary under the inverse map- ping of the neural network. It can be seen that the state far from the boundary is often included by large probabilistic hulls, i.e., some unnecessary subdivisions can be avoided using our boundary-aware method. The rest of the section ad- dresses the experimental evaluation goals as mentioned previously. 5.2 ACAS Xu Benchmark The ACAS Xu collision avoidance system benchmarks [8] consist of 45 networks (5×9 configurations) with 5 inputs, 6 layers, and 50 neurons per layer. We test Property P2 in the benchmarks which requires the network not to output “COC” as the optimal action. This means the network output is unsafe when the first output is the minimal. In this comparison, all methods filter regions (branches in ProbStar’s case) smaller than 10−5. We use 1000 initial sampling points and 100 iterative sampling points in our approach. As for the hyperparameters, we performed the grid search on ACAS network 1-6 for specification 2 and determine the optimal configuration to be pure dis- tributional sampling (0.0,1.0), depth 5, β = 0.75, α = 0.05. We therefore adopt these hyperparameter settings for the ACAS benchmark. The full illustration of the grid search can be viewed in Section A. Table 1 shows the results for all tools, where the best bounds and runtimes are marked in bold (and for all subsequent tables). From the table, we observe that BaB times out on all the benchmarks, and our method mostly generates better bounds than ProbStar. Specifically, our bounds uniformly outperforms ProbStar in the bounds for unknown probabilities Us −Ls, and our runtime is comparable with ProbStar. Besides, the parallel paradigm originated from CROWN yields substantial performance gains, with an average speedup of 4.8× compared to our non-parallel version. 5.3 Rocket Lander Controller Benchmark To further demonstrate the ability of our method to handle high-dimensional neural networks, we conduct evaluation on the Rocket Lander [22], which sim- ulates the vertical landing control of a SpaceX Falcon 9 first-stage rocket. The objective is to safely land the rocket on a sea-based platform from a given height, which requires precise control of both the rocket’s velocity and lateral angle. The control system outputs three components: a main engine thruster (FE ∈ [0, 1]) with an adjustable angle (ϕ), and two side nitrogen thrusters (FS ∈[−1, 1], where -1 and 1 indicate full throttle of right and left thruster respectively). The system state is represented by a 9-dimensional vector: (x, y, vx, vy, θ, ω, F ′ E, F ′ S, ϕ′), where (x, y) is the position relative to the landing center; (vx, vy) are the hor- izontal and vertical velocities; θ is the lateral angle; ω is the angular velocity; and (F ′ E, F ′ S, ϕ′) are the previous control actions. --- Page 12 --- 12 J. Li et al. Table 1. Results on ACAS Xu Benchmark for Property P2. Legend - Net: DNN benchmarks, Approach: our approach ("Ours"), ProbStar and BaB, Ls: the lower bound of the safety probability interval, Us: the upper bound of the safety probability interval, Us −Ls: upper bound for the unknown probability, Time (non parallel): the runtime without parallel computing in seconds, Time (parallel): the runtime with par- allel computing in seconds, T.O.: Time out, >2 hours. Net Approach Ls Us Us −Ls Time (non parallel) Time (parallel) 1-6 Ours 0.984714 1.000000 0.015286 124.21 46.58 1-6 ProbStar 0.933739 0.997192 0.063453 2840.35 315.75 1-6 BaB - - - T.O. T.O. 2-2 Ours 0.924122 0.991622 0.067500 1248.70 240.79 2-2 ProbStar 0.892540 0.980435 0.087895 3869.90 430.20 2-2 BaB - - - T.O. T.O. 2-9 Ours 0.954046 0.999941 0.045895 819.22 249.24 2-9 ProbStar 0.879616 0.999745 0.120129 6715.75 746.56 2-9 BaB - - - T.O. T.O. 3-1 Ours 0.923320 0.972056 0.048735 949.94 189.76 3-1 ProbStar 0.913943 0.969500 0.055557 2462.36 273.73 3-1 BaB - - - T.O. T.O. 3-6 Ours 0.914964 0.982248 0.067284 1808.11 359.85 3-6 ProbStar 0.879708 0.979217 0.099509 5191.90 577.16 3-6 BaB - - - T.O. T.O. 3-7 Ours 0.923570 0.999888 0.076318 1817.89 386.33 3-7 ProbStar 0.911578 0.997681 0.086103 4187.35 465.49 3-7 BaB - - - T.O. T.O. 4-1 Ours 0.961700 0.999953 0.038253 884.45 155.31 4-1 ProbStar 0.914955 0.998963 0.084008 3120.21 346.86 4-1 BaB - - - T.O. T.O. 4-7 Ours 0.913432 0.979347 0.065914 1604.20 381.73 4-7 ProbStar 0.878505 0.979215 0.100710 3917.94 435.54 4-7 BaB - - - T.O. T.O. 5-3 Ours 0.973373 1.000000 0.026627 336.51 81.21 5-3 ProbStar 0.953952 1.000000 0.046048 1306.79 145.27 5-3 BaB - - - T.O. T.O. --- Page 13 --- Probabilistic Verification of Neural Networks 13 We directly use the control policy network provided in ProbStar. The network has 9 inputs, 3 outputs, and 5 hidden layers containing 20 ReLU neurons each. We focus on two critical safety properties that ensure proper tilt control: – P1 (Right Tilt Prevention): For states where θ ∈[−20◦, −6◦], ω < 0, ϕ′ ≤0, and F ′ S ≤0, the network must output either ϕ < 0 or FS < 0. – P2 (Left Tilt Prevention): For states where θ ∈[6◦, 20◦], ω ≥0, ϕ′ ≥0, and F ′ S ≥0, the network must output either ϕ > 0 or FS > 0. As for the hyperparameters, we performed the grid search on Net 1 for specification 1 and determine the optimal configuration to be mixed sampling (wuniform = 0.25, wdistribution = 0.75), depth 5, β = 0.0, α = 0.05. We there- fore adopt these hyperparameter settings for this benchmark. The scatter point figure of the grid search can be viewed in Section A. We conduct experiments on the Rocket Lander controller to compare our method with ProbStar. In this comparison, all methods filter regions (branches in ProbStar’s case) smaller than 10−3. We do not run BaB since it cannot handle high-dimensional input due to inherent combinatorial explosion. For our method, we use 9000 initial sampling points and 900 iterative sampling points to adapt to the high input dimension of this benchmark. Similar to the setting in Table 1, we compare both the bounds and the runtime (including parallel and non-parallel). The results are shown in Table 2 (non-parallel) and Table 3 (parallel). From these tables, we observe that our method consistently generates significantly tighter bounds for the unknown probability (Us −Ls) across all configurations. Further- more, our method demonstrates superior time efficiency, achieving substantial runtime reductions in both non-parallel and parallel settings. Table 2. Experiments on Rocket Lander Controller (non-parallel). Net & Prop means different configurations in the original benchmark. Others are the same as Table 1. Net & Prop Approach Us −Ls Time (s) 0 & 1 ProbStar 0.715046 418.183 0 & 1 Ours 0.650548 171.933 0 & 2 ProbStar 0.831752 597.775 0 & 2 Ours 0.262993 85.482 1 & 1 ProbStar 0.554032 438.278 1 & 1 Ours 0.229874 70.723 1 & 2 ProbStar 0.356913 319.571 1 & 2 Ours 0.079113 23.287 5.4 Efficiency of Finding Probabilistic Hulls We compare the efficiency of our method and the BaB on finding safe and unsafe probabilistic hulls, since both of the methods use probabilistic hulls to estimate --- Page 14 --- 14 J. Li et al. Table 3. Experiments on Rocket Lander Controller (parallel). The legend is the same as Table 2. Net & Prop Approach Us −Ls Time (s) 0 & 1 ProbStar 0.715046 141.047 0 & 1 Ours 0.651642 33.767 0 & 2 ProbStar 0.831752 225.331 0 & 2 Ours 0.265237 23.573 1 & 1 ProbStar 0.554032 225.331 1 & 1 Ours 0.230962 18.900 1 & 2 ProbStar 0.356913 117.494 1 & 2 Ours 0.079063 12.947 the safe probability. We use the distilled DNNs [4] of the ones in the ACAS Xu set. All of the new DNNs have tanh activation function. Moreover, we consider the parallel setting for both our method and BaB. The hyperparameters are the same as in the original ACAS Xu experiments. The comparison is made in the following way. We set up an end time of 1800 seconds to let both of the methods find probabilistic hulls. Each of the methods may terminate either at the end time or when the early termination condition holds (i.e., the highest probability of the undetermined sets is less than the given threshold ε, similar to maxR∈Runknown P(R) ≤ε). Then we compare the runtime and the estimated upper and lower bounds for the safe probability. Clearly, the tighter the bounds, the better the result. We use ε = 10−5, and the experimental results are given in Table 4. It can be seen that our method always stops before the end time and the runtime are much lower than the basic BaB method. On the other hand, the bounds obtained from our method are also much tighter than those from the basic BaB ones. Hence, our method is much more efficient in finding safe and unsafe probabilistic hulls. 5.5 Discussion Basically, our method can handle most of the feedforward DNNs with various activation functions, since we use the given DNN as a blackbox to produce the samples. This is a clear advantage over the state-of-the-art tools such as ProbStar which requires neural networks only to have ReLU activation functions. On the other hand, our boundary-aware subdivision demonstrate a great performance improvement in not only time efficiency but also accuracy comparing to the basic branch-and-bound method. In the experiments, we also find two weaknesses of our method. First, the total runtime highly depends on the time spent by CROWN which is used to verify the safety of a hull in the input space. It can be improved by designing a more efficient output range analysis technique which is particularly for the hulls --- Page 15 --- Probabilistic Verification of Neural Networks 15 Table 4. Efficiency of Finding Probabilistic Hulls (parallel). Legend - Time: runtime in seconds, End: end time = 1800 seconds, Others are the same as those in Table 1. Net Approach Ls Us Us −Ls Time (s) 1-6 Ours 0.9815 0.9995 0.0180 264.00 1-6 BaB 0.9397 1.0000 0.0603 End 2-2 Ours 0.9819 0.9997 0.0178 142.30 2-2 BaB 0.9616 1.0000 0.0384 End 2-9 Ours 0.9502 0.9994 0.0492 686.66 2-9 BaB 0.8896 1.0000 0.1104 End 3-1 Ours 0.9865 1.0000 0.0135 68.06 3-1 BaB 0.9845 1.0000 0.0155 1413.22 3-6 Ours 0.9293 0.9976 0.0682 1188.63 3-6 BaB 0.7738 1.0000 0.2262 End 3-7 Ours 0.8673 0.9448 0.0775 1338.55 3-7 BaB 0.7682 1.0000 0.2318 End 4-1 Ours 0.9864 1.0000 0.0136 59.52 4-1 BaB 0.9837 1.0000 0.0163 1019.67 4-7 Ours 0.9778 1.0000 0.0222 397.20 4-7 BaB 0.8993 1.0000 0.1007 End 5-3 Ours 0.9823 1.0000 0.0177 303.09 5-3 BaB 0.8886 1.0000 0.1114 End in our problem. Second, our method still suffers from the curse of dimensional- ity in the worst case, although the boundary-aware subdivision helps a lot on avoiding subdividing the non-critical regions. This aspect can be improved by considering symbolic representations of the hulls, such as Taylor models [5]. Besides, we observed that some hulls only have a very small intersection with the boundary but requires many subdivisions to find purely safe or unsafe subsets. For such a hull, we consider to develop shrinking methods that can fast find subsets by contracting the original hull. 6 Conclusion We present a novel approach of probabilistic verification of DNNs and the main idea is to efficiently find safe or unsafe probabilistic hulls using a boundary-aware subdivision strategy. Our method does not have a limitation to particular acti- vation functions and shows a much better performance than the similar existing techniques. In the future, we seek to improve the method by (i) developing more ef- ficient output range computation techniques particularly for the probabilistic hulls, to make the verification step less time-consuming; (ii) designing new sym- bolic representations for the hulls such that they are not limited to boxes or parallelotopes. --- Page 16 --- 16 J. Li et al. References 1. Bain, L.J., Engelhardt, M.: Introduction to Probability and Mathematical Statis- tics (2nd ed.). Cengage Learning (1992) 2. Boetius, D., Leue, S., Sutter, T.: Probabilistic verification of neural networks using branch and bound. CoRR abs/2405.17556 (2024) 3. Cardelli, L., Kwiatkowska, M., Laurenti, L., Paoletti, N., Patane, A., Wicker, M.: Statistical guarantees for the robustness of bayesian neural networks. In: Proceed- ings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019. pp. 5693–5700 (2019) 4. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. Int. J. Comput. Vis. 129(6), 1789–1819 (2021) 5. Huang, C., Fan, J., Chen, X., Li, W., Zhu, Q.: POLAR: A polynomial arithmetic framework for verifying neural-network controlled systems. In: Bouajjani, A., Holík, L., Wu, Z. (eds.) Proceedings of the 20th International Symposium on Automated Technology for Verification and Analysis. Lecture Notes in Computer Science, vol. 13505, pp. 414–430. Springer (2022) 6. Jaulin, L., Kieffer, M., Didrit, O., Walter, E.: Applied Interval Analysis. Springer (2001) 7. Katz, G., Barrett, C.W., Dill, D.L., Julian, K.D., Kochenderfer, M.J.: Reluplex: An efficient SMT solver for verifying deep neural networks. In: The 29th International Conference on Computer Aided Verification (CAV 2017). vol. 10426, pp. 97–117. Springer (2017) 8. Manfredi, G., Jestin, Y.: An introduction to acas xu and the challenges ahead. In: 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC). pp. 1–9 (2016) 9. Mangal, R., Nori, A.V., Orso, A.: Robustness of neural networks: a probabilis- tic and practical approach. In: Proceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results, ICSE (NIER) 2019, Canada. pp. 93–96 (2019) 10. Meng, M.H., Bai, G., Teo, S.G., Hou, Z., Xiao, Y., Lin, Y., Dong, J.S.: Adversarial robustness of deep neural networks: A survey from a formal verification perspective. CoRR abs/2206.12227 (2022) 11. Mitchell, T.M.: Machine Learning. McGraw-Hill Education (1997) 12. Pilipovsky, J., Sivaramakrishnan, V., Oishi, M., Tsiotras, P.: Probabilistic verifica- tion of relu neural networks via characteristic functions. In: Learning for Dynamics and Control Conference, L4DC 2023, 15-16 June 2023, Philadelphia, PA, USA. Proceedings of Machine Learning Research, vol. 211, pp. 966–979 (2023) 13. Tambon, F., Laberge, G., An, L., Nikanjam, A., Mindom, P.S.N., Pequignot, Y., Khomh, F., Antoniol, G., Merlo, E., Laviolette, F.: How to certify machine learning based safety-critical systems? A systematic literature review. Autom. Softw. Eng. 29(2), 38 (2022) 14. Tran, H., Choi, S., Okamoto, H., Hoxha, B., Fainekos, G., Prokhorov, D.V.: Quan- titative verification for neural networks using probstars. In: Proceedings of the 26th ACM International Conference on Hybrid Systems: Computation and Con- trol, HSCC 2023, USA,. pp. 4:1–4:12 (2023) 15. Wang, S., Zhang, H., Xu, K., Lin, X., Jana, S.S., Hsieh, C.J., Kolter, J.Z.: Beta- crown: Efficient bound propagation with per-neuron split constraints for neural network robustness verification. In: 34th Annual Conference on Neural Information Processing Systems, (NeurIPS 2021). pp. 29909–29921 (2021) --- Page 17 --- Probabilistic Verification of Neural Networks 17 16. Webb, S., Rainforth, T., Teh, Y.W., Kumar, M.P.: A statistical approach to as- sessing neural network robustness. In: 7th International Conference on Learning Representations, USA (2019) 17. Weng, T.W., Chen, P.Y., Nguyen, L.M., Squillante, M.S., Boopathy, A., Oseledets, I., Daniel, L.: Proven: Certifying robustness of neural networks with a probabilistic approach 97 (2019) 18. Wicker, M., Laurenti, L., Patane, A., Paoletti, N., Abate, A., Kwiatkowska, M.: Probabilistic reach-avoid for bayesian neural networks. Artif. Intell. 334, 104132 (2024) 19. Wicker, M., Patane, A., Laurenti, L., Kwiatkowska, M.: Adversarial robustness certification for bayesian neural networks. In: Formal Methods - 26th International Symposium, FM 2024, Milan, Italy, September 9-13, 2024, Proceedings, Part I. Lecture Notes in Computer Science, vol. 14933, pp. 3–28 (2024) 20. Wu, H., Isac, O., Zeljic, A., Tagomori, T., Daggitt, M.L., Kokke, W., Refaeli, I., Amir, G., Julian, K., Bassan, S., Huang, P., Lahav, O., Wu, M., Zhang, M., Komendantskaya, E., Katz, G., Barrett, C.W.: Marabou 2.0: A versatile formal analyzer of neural networks. In: Computer Aided Verification - 36th International Conference, CAV 2024, Montreal, QC, Canada, July 24-27, 2024, Proceedings, Part II. Lecture Notes in Computer Science, vol. 14682, pp. 249–264 (2024) 21. Xu, K., Shi, Z., Zhang, H., Wang, Y., Chang, K., Huang, M., Kailkhura, B., Lin, X., Hsieh, C.: Automatic perturbation analysis for scalable certified robustness and beyond. In: Advances in Neural Information Processing Systems 33: Annual Con- ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020) 22. Yang, X., Yamaguchi, T., Tran, H., Hoxha, B., Johnson, T.T., Prokhorov, D.V.: Neural network repair with reachability analysis. In: Formal Modeling and Analy- sis of Timed Systems - 20th International Conference, FORMATS 2022, Warsaw, Poland, September 13-15, 2022, Proceedings. Lecture Notes in Computer Science, vol. 13465, pp. 221–236. Springer (2022) 23. Zhang, H., Weng, T.W., Chen, P.Y., Hsieh, C.J., Daniel, L.: Efficient neural net- work robustness certification with general activation functions. In: 31st Annual Conference on Neural Information Processing Systems 2018, (NeurIPS 2018). pp. 4944–4953 (2018) A Hyperparameter and Grid Search We performed an extensive grid search to determine the best configuration for the core components of our approach: the sampling strategy, the impurity scheduler, and the regression tree depth. A.1 ACAS Xu Benchmark For the ACAS Xu benchmark, the search space was defined as follows. – Sampling Weights (wuniform, wdistribution): Five configurations were tested. (0.0, 1.0), (0.25, 0.75), (0.5, 0.5), (0.75, 0.25), and (1.0, 0.0). – Regression Tree Depth: Two values were considered. 5 and 10. – Impurity Splitting (β): Five values were tested. 0, 0.25, 0.5, 0.75, and 1.0. --- Page 18 --- 18 J. Li et al. – Impurity Coefficient (α): Six values were evaluated. 0.05, 0.1, 0.3, 0.5, 1.0, and 2.0. This parameter space resulted in 25 impurity configurations and a total of 250 parameter combinations. The evaluation metrics for each combination were the remaining unknown region probability (Us −Ls) and the corresponding runtime. The grid search was performed on ACAS network 1-6 for specification 2. Figure 4 presents the results, mapping the trade-off between Us −Ls (vertical axis) and runtime (horizontal axis). Each point represents a distinct configura- tion. The red star highlights the Pareto optimal configuration, which outper- forms all other combinations in both metrics. Gray points indicate configurations that are dominated by the Pareto solution in at least one metric. The search identified the optimal configuration for the ACAS benchmark as pure distributional sampling (0.0, 1.0), a regression tree depth of 5, β = 0.75, and α = 0.05. These settings were subsequently adopted for the main experiments on the ACAS Xu benchmark. 0 200 400 600 800 1000 1200 Runtime (seconds) 0.05 0.10 0.15 0.20 Us-Ls Pareto Optimal ((0.00, 1.00), 5, (0.05, 0.75)) Other Configurations Pareto Optimal (Best & Fastest) Fig. 4. Scatter plot showing the relationship between Us-Ls and runtime for 250 dif- ferent parameter configurations tested in the grid search. A.2 Rocket Lander Benchmark The same systematic grid search methodology was applied to the Rocket Lander controller. The search space for this benchmark was defined as follows. – Sampling Weights (wuniform, wdistribution). The same five configurations were tested. (0.0, 1.0), (0.25, 0.75), (0.5, 0.5), (0.75, 0.25), and (1.0, 0.0). --- Page 19 --- Probabilistic Verification of Neural Networks 19 20 40 60 80 100 Runtime (seconds) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Us-Ls Pareto Optimal ((0.25, 0.75), 5, (0.05, 0.00)) Other Configurations Pareto Optimal (Best & Fastest) Fig. 5. Scatter plot showing the relationship between Us-Ls and runtime for 300 dif- ferent parameter configurations tested in the grid search. – Regression Tree Depth. Three values were considered. 5, 10, and 20. – Impurity Splitting (β). Five values were tested. 0, 0.25, 0.5, 0.75, and 1.0. – Impurity Coefficient (α). Five values were evaluated. 0.05, 0.1, 0.3, 0.5, and 1.0. This yielded a total of 300 parameter combinations. The grid search was per- formed on Rocket Net 1 for specification 1. Figure 5 presents the corresponding Pareto front analysis. Consistent with the ACAS results, the Us −Ls is plotted against the runtime, and the red star marks the Pareto optimal configuration. The optimal configuration determined for the Rocket Lander benchmark was a *mixed sampling strategy (0.25, 0.75), a regression tree depth of 5, β = 0.0, and α = 0.05**. These settings were adopted for the main experiments on the Rocket Lander benchmark. B Ablation Studies We conduct a comprehensive ablation study to analyze the independent con- tribution and combinatorial effects of the key components in our proposed ap- proach. The components investigated include the weighted mixed sampling and the impurity scheduler defined by α and β in our hyperparameter settings. The ablation studies are conducted on both the high-dimensional Rocket Lander and the moderate-dimensional ACAS Xu benchmarks. The following sections detail the findings, quantifying the impact of each mechanism on verification tightness (Us −Ls) and time efficiency across these diverse benchmarks. --- Page 20 --- 20 J. Li et al. Table 5. Ablation Study on Rocket Lander and ACAS Xu (Average Results). The table compares the average unknown probability (Us −Ls) and average runtime for different component configurations on two benchmarks. The weights are wu (uniform) and wd (distributional). Configuration (wu, wd) Rocket Lander ACAS Xu Avg. Us −Ls Avg. Time (s) Avg. Us −Ls Avg. Time (s) Baseline Configurations (Impurity Scheduler Disabled) Baseline (1.00, 0.00) 0.9194 30.26 0.0650 279.46 (0.00, 1.00) 0.9242 24.86 0.0681 279.05 (0.75, 0.25) 0.9134 30.74 0.0652 277.95 (0.50, 0.50) 0.9124 29.22 0.0664 279.09 (0.25, 0.75) 0.9124 27.80 0.0662 278.14 Configurations with Impurity Scheduler Enabled + Impurity Scheduler (1.00, 0.00)∗ 0.2466 23.17 0.0341 211.61 (0.00, 1.00) 0.3294 25.23 0.0368 221.05 (0.75, 0.25) 0.2371 21.25 0.0340 210.33 (0.50, 0.50) 0.2705 22.02 0.0344 214.28 (0.25, 0.75) 0.2576 21.71 0.0350 218.96 ∗For ACAS Xu, the impurity scheduler uses β = 0.75 and α = 0.05 as the representative Impurity-only configuration for comparison. For Rocket Lander, the impurity scheduler uses β = 0.0 and α = 0.05 as the representative Impurity-only configuration for comparison. B.1 Contribution of Sampling and Impurity Scheduling The analysis of results presented in Table 5 focuses on the core question of whether sampling alone is effective and how the Impurity Scheduler contributes to the final performance. The data reveals that the Impurity Scheduler is the critical component for achieving high verification accuracy (Us −Ls). When the Impurity Scheduler is disabled, varying the sampling ratio only provides marginal improvements in Us −Ls. For the Rocket Lander, the best sampling ratio (wu = 0.25 or wu = 0.50) achieves 0.9124, barely better than the worst sampling ratio (0.9194). For ACAS Xu, the sampling variance shows negligible impact on accuracy. However, enabling the Impurity Scheduler changes the performance profile. Firstly, for the pure uniform sampling case (wu = 1.00), Us −Ls is reduced from 0.9194 to 0.2466 for Rocket Lander and from 0.0650 to 0.0341 for ACAS Xu. This confirms that the scheduler can significantly tighten the verification bounds. Furthermore, when the Impurity Scheduler is active, optimizing the mixed sampling ratio becomes relevant, providing a measurable gain. For both bench- marks, the (0.75, 0.25) mixed ratio achieves the tightest overall bound (0.2371 for Rocket Lander and 0.0340 for ACAS Xu). This indicates that while sampling is not the main driver of accuracy, an optimal mix of uniform and distributional sampling is necessary for the final refinement. --- Page 21 --- Probabilistic Verification of Neural Networks 21 Table 6. Ablation Study on Impurity Scheduler Threshold (β). The table compares the performance (Us −Ls and Runtime) of the Impurity Scheduler by varying the threshold β, which controls the transition from longest-dimension splitting (β = 1.0) to soft-coded impurity splitting (β < 1.0). Only pure uniform sampling (wu = 1.00) is used. Threshold β Splitting Mode Rocket Lander ACAS Xu Avg. Us −Ls Avg. Time (s) Avg. Us −Ls Avg. Time (s) Baseline (β Implied) Longest-Dimension 0.9194 30.26 0.0650 279.46 0.75 Late Activation 0.9194 30.24 0.0341 211.61 0.50 Mid Activation 0.9194 30.53 0.0342 195.23 0.25 Early Activation 0.9194 30.29 0.0378 210.71 0.00 Immediate Soft Impurity 0.2466 23.17 0.0610 349.11 Regarding runtime efficiency, the runtime for nearly all configurations, re- mains within a similar order of magnitude on both benchmarks. For instance, the Rocket Lander configurations generally fall between 21s and 31s, and the ACAS Xu configurations are mostly between 210s and 280s. This suggests that the major performance contribution of the Impurity Scheduler lies in the massive improvement of accuracy (Us −Ls) convergence, while overall execution time remains relatively constant across effective configurations. B.2 Ablation on Impurity Scheduler Threshold (β) Given the critical role of the Impurity Scheduler, we conduct a focused ablation study by varying the threshold β, which controls the strategy for transitioning from standard longest-dimension splitting to the proposed soft-coded impurity measure (α-weighted splitting). The results, using pure uniform sampling (wu = 1.00), are summarized in Table 6. The optimal transition strategy is found to be benchmark-dependent. For Rocket Lander benchmark, the critical finding is that the Immediate Soft Impurity (β = 0.00) mode is optimal. This configuration achieves the tightest bound of 0.2466 and the best runtime of 23.17s within this set. This suggests that the longest-dimension splitting provides minimal structural benefit, and immediate application of the boundary-aware measure is key. As for ACAS Xu, a hybrid splitting strategy is essential. The Immediate Soft Impurity (β = 0.00) yields a poor Us −Ls of 0.0610 and the longest runtime of 349.11s. In contrast, the best accuracy (0.0341) is achieved when β = 0.75 (Late Activation), allowing the longest-dimension splitting to run for the initial portion of the verification. The best runtime (195.23s) is achieved at β = 0.50 (Mid Activation). This confirms that for ACAS Xu, initial global refinement via standard splitting significantly accelerates the process before the fine-grained, boundary-aware splits are activated. The distinct optimal β values can be explained by the fundamental differences in input dimensionality and the subsequent geometric challenges presented to the verification method (which relies on convex hull approximations): --- Page 22 --- 22 J. Li et al. 1. Geometric Control: In methods utilizing convex hull approximations, the Longest-Dimension splitting is employed primarily to maintain a low inter- dimension ratio for the divided regions. Preventing regions from becoming highly elongated is vital, as this mitigates the over-approximation error that occurs when hull methods are applied to non-square regions. 2. High Dimensionality (Rocket Lander): The 9-dimensional space of Rocket Lander limits the effectiveness of pure Longest-Dimension splitting. The soft impurity measure from the beginning is necessary because its ge- ometric penalty term (Lα) simultaneously guides the split toward reducing boundary uncertainty (via MSE) and acts to regularize the region’s aspect ratio by penalizing splits across large dimensions (L). This combined effi- ciency makes immediate activation optimal, as the structural splitting stage provides little benefit. 3. Initial Refinement Need (ACAS Xu): In the lower 5-dimensional ACAS Xu space, the Longest-Dimension splitting remains an efficient method for achieving an initial, coarse, global subdivision. This initial phase (enabled by β = 0.75 or β = 0.50) establishes regions with more uniform aspect ratios. This preprocessing ensures that when the fine-grained soft impurity measure is activated later, it operates on geometrically well-behaved regions, thereby improving both accuracy and overall convergence speed. This analysis confirms that our impurity scheduler is a highly effective compo- nent, but its optimal configuration requires careful tuning based on the specific dimensional and structural characteristics of the benchmark.

Chunks

Chunk 0

Chunk 1

It is significant to answer this problem when the input is affected by distur- bances often modeled by probabilistic variables. In the paper, we propose a novel neural network probabilistic verification framework which com- putes a guaranteed range for the safe probability by efficiently finding safe and unsafe probabilistic hulls.

Chunk 2

Our approach consists of three main innovations: (1) a state space subdivision strategy using regression trees to produce probabilistic hulls, (2) a boundary-aware sampling method which identifies the safety boundary in the input space using samples that are later used for building regression trees, and (3) iterative refine- ment with probabilistic prioritization for computing a guaranteed range for the safe probability. The accuracy and efficiency of our approach are evaluated on various benchmarks including ACAS Xu and a rocket lan- der controller.

Chunk 3

The result shows an obvious advantage over the state of the art. Keywords: Neural Network · Formal Verification · Probabilistic Verifi- cation.

Chunk 4

1 Introduction Formal verification of deep neural networks (DNNs) is crucial for ensuring the trustworthiness and the reliability of learning-enabled systems in safety-critical scenarios [10, 13]. While most existing techniques aim at solving the output range analysis problem of DNNs [7, 20, 23], it is still difficult to know the safety of a DNN when the input is affected by noises that are often modeled by random variables.

Chunk 5

Since noises or disturbances are often described by Gaussian variables, their values are not bounded and can hardly be handled by existing output range analysis methods. We propose a new approach which generates a guaranteed range for the safety of a DNN when its input is Gaussian variables.

Chunk 6

Probabilistic verification [17, 9, 16, 2, 14] involves analyzing the probability of safety violations under specified input distributions instead of giving a determin- istic answer to the safety query. In order to estimate the probability of satisfying arXiv:2604.21556v1 [cs.AI] 23 Apr 2026 --- Page 2 --- 2 J.

Chunk 7

Li et al. the safe constraints in the output space of a DNN, a straightforward approach could be to compute the output distribution and then integrate its density func- tion over the safe output space.

Chunk 8

However, it is often not feasible due to the complexity of DNNs. In this paper, we propose a method which conservatively estimate the safety of a DNN based on a set of safe or unsafe probabilistic hulls that at most intersect on facets.

Chunk 9

A probabilistic hull is a closed and bounded set in the input space whose orientation is aligned with the Gaussian distribution of the input, such that the probability of it can be efficiently computed based on the error function [1]. An upper bound of safety can be obtained by subtracting the probabilities of all probabilistic hulls which are purely unsafe, while a lower bound can be derived by adding the probabilities of all safe hulls.

Chunk 10

The main contribution of our method is an efficient search strategy for finding large safe and unsafe hulls. Existing probabilistic neural network verification methods suffer from the curse of dimensionality.

Chunk 11

Typically, the existing set-based techniques [23, 15] often prioritize partitioning the input space uniformly that cannot recognize critical regions. This leads to exponential amount of subdivisions and is time consuming.

Chunk 12

Other methods, such as ProbStar [14] symbolically propagates the input set to the output space and then estimate their safe probability, it is able to more accurately track the input-output dependency however cannot handle the DNNs whose activations are not ReLU. We study the probabilistic verification problem on feedforward DNNs.

Chunk 13

We propose a novel regression tree-guided probabilistic verification framework that achieves both efficiency and generality. Our key insight is that neural network decision/safe boundaries, i.e., the boundary dividing the safe and unsafe set in the output space, can be effectively approximated through adaptive region par- titioning guided by sampling-based regression trees, eliminating the unnecessary exploration and subdivision in the regions that are purely safe or unsafe.

Chunk 14

Our approach makes three primary contributions: (1) We introduce a regression tree- guided state space splitting strategy for the input space that efficiently generates purely safe or unsafe probabilistic hulls of large probabilities. It avoids the un- necessary subdivisions in the safe or unsafe regions in most practical cases.

Chunk 15

(2) We propose a boundary-aware sampling method which effectively identifies the safety boundary in the input space by random samples which are later used to build regression trees. (3) We develop an incremental sampling mechanism that iteratively refines the undecided regions with the largest probability, avoiding the prohibitive cost of dense sampling across the entire input space.

Chunk 16

Experimen- tal evaluation on the challenging benchmarks demonstrates that our method achieves superior accuracy against the state of the art and achieves up to 10× speedup compared to basic branch-and-bound approaches. 2 Related Work Linear bound propagation techniques, exemplified by CROWN [23], have rev- olutionized neural network verification by extending traditional interval bound --- Page 3 --- Probabilistic Verification of Neural Networks 3 propagation to linear function-based bound propagation.

Chunk 17

This advancement sig- nificantly improves scalability while maintaining reasonable tightness for veri- fication tasks. Building upon CROWN, β-CROWN [15] integrates branch-and- bound (BaB) techniques with GPU acceleration to achieve both completeness and scalability in verification processes.

Chunk 18

Probabilistic verification has emerged as a critical paradigm for quantify- ing neural network safety under stochastic inputs. One part of the research focus on Bayesian neural network verification [3, 18, 19].

Chunk 19

Early approaches like PROVEN [17] established distribution-aware robustness guarantees by integrat- ing probabilistic measures with formal verification techniques. Subsequently, hy- brid methods [9] combined abstract interpretation with Monte Carlo sampling to balance computational efficiency with probabilistic assurances.

Chunk 20

Statistical evalu- ation frameworks [16] further advanced the field by employing adaptive sampling strategies to estimate perturbation tolerance while quantifying violation proba- bilities. Besides, [12] proposes a characteristic-function-based probabilistic ver- ification method which estimates the safe probability via frequency-domain.

Chunk 21

Recent developments have focused on scalability. The probabilistic branch- and-bound approach [2] iteratively refines probability bounds through dynamic region subdivision, achieving improved verification success rates but suffering from exponential complexity in high-dimensional spaces.

Chunk 22

Most notably, Prob- Star [14] introduces probabilistic star-based reachability analysis for efficient uncertainty propagation in safety-critical systems. However, ProbStar is limited to ReLU networks due to its reliance on linear region enumeration, restricting its applicability to networks with generic activation functions.

Chunk 23

3 Preliminaries A deterministic input to a DNN can be denoted by a column vector x = (x1, . .

Chunk 24

. , xd).

Chunk 25

Therefore, a noise-affected input can be denoted by x + w where w = (w1, . .

Chunk 26

. , wd) are the noises for each input variable.

Chunk 27

In this paper, we fo- cus on Gaussian noises, i.e., w subject to a multivariate Gaussian distribution, since they are the most popular models for noises in practice. Since Gaussian distributions are closed under affine transformations, a noise-affected input can be represented by Gaussian variables, i.e., x ∼N(µ, Σ) where µ is the mean and Σ is the covariance matrix.

Chunk 28

Problem Formulation. Let f : Rd →Rm denote the input-output mapping of a (feedforward) DNN which has d inputs and m outputs.

Chunk 29

Given a Gaussian input x ∼N(µ, Σ) and a safety property in the output space: Γ(y) : Cy ≥a, such that the property is defined by a finite set of affine constraints/inequalities. Here, y is a point/vector in the output space, i.e., y = f(x), C ∈Rk×m is a matrix of coefficients, and a ∈Rk is a constant vector.

Chunk 30

The DNN probabilistic verification problem asks to estimate the probability P(Γ(y) | x ∼N(µ, Σ)). (1) --- Page 4 --- 4 J.

Chunk 31

Li et al. As we pointed out that an exact calculation of the probability (1) is in- tractable.

Chunk 32

Therefore, we conservatively estimate an upper bound as well as a lower bound for the actual probability via computing a finite set of safe and unsafe probabilistic hulls. Given a set of Gaussian variables x ∼N(µ, Σ) such that Σ = diag(σ2 1, .

Chunk 33

. .

Chunk 34

, σ2 d), i.e., all variables subject to independent distributions. The probability P(x ∈H), or simply P(H), where the hull H is a box [a1, b1]×· · ·×[ad, bd] can be computed by evaluating the expression Q 1≤i≤d P(ai ≤xi ≤bi) such that P(ai ≤xi ≤bi) = 1 2 erf(bi −µ √ 2σi ) −erf(ai −µ √ 2σi ) where erf is the error function [1].

Chunk 35

For any full rank covariance matrix Σ, a hull H aligned with its orientation can be calculated in a similar way. That is, we may transform Σ to a diagonal matrix, transform H to a box using the same mapping and evaluate the probability of being in H with the transformed distribution.

Chunk 36

Hence, for simple denotation, we assume that Σ is diagonal. In a DNN verification problem, a hull H is safe (unsafe resp.) when all points in its output set f(H) satisfying (violating resp.) the safe property Γ.

Chunk 37

Input f, Γ, µ, Σ, ε Initialize Runknown = [Space] Rsafe, Runsafe = ∅ P R∈Runknown P(R) ≤ε? Output Safe probability interval [Ls, Us] Select Region Region Rnext with largest probability in Runknown Boundary-Aware Subdivision Extract probabilsitic hulls in Rnext Verify and Update Region Sets Add to Rsafe/Runsafe/Runk No Yes Fig.

Chunk 38

1. Workflow of our regression tree-guided probabilistic verification framework.

Chunk 39

4 Main Approach The main idea of our approach for estimating conservative upper and lower bounds for the actual safe probability is to iteratively construct a set of safe hulls: Hs 1, . .

Chunk 40

. , Hs N and a set of unsafe hulls: Hu 1 , .

Chunk 41

. .

Chunk 42

, Hu M, such that all hulls intersect at most on the facets. Therefore, each hull represents an independent probability from the others.

Chunk 43

Then, the upper bound is computed as Us = 1−P 1≤i≤M P(Hu i ), while the lower bound is obtained as Ls = P 1≤i≤N P(Hs i ). Theorem 1.

Chunk 44

If Hs 1, . .

Chunk 45

. , Hs N are safe and Hu 1 , .

Chunk 46

. .

Chunk 47

, Hu M are unsafe, then we have that Us ≥Ls, and the actual safe probability must be in the interval [Ls, Us]. --- Page 5 --- Probabilistic Verification of Neural Networks 5 Obviously, if the probabilistic hulls can be efficiently found and have high probabilities, an accurate range for safe probability can be obtained.

Chunk 48

Unlike the related methods, the probabilistic hulls are not generated from random samples. Instead, we build (safe) boundary-ware regression trees which subdivide the input space into box regions, then we use CROWN to verify the safety of them and identify the safe and unsafe subregions as hulls.

Chunk 49

In the experiments, we show that such a strategy greatly outperforms the method using uniform sampling. The high-level workflow of our framework is given in Figure 1.

Chunk 50

Given a DNN f, a safety property Γ, a Gaussian distribution N(µ, Σ) for the input, and a ter- mination threshold ε > 0. Here, we only need to use the DNN as a blackbox.

Chunk 51

The main algorithm keeps three sets of probabilistic hulls: Rsafe, Runsafe, Runknown that are used to keep the set of safe, unsafe and unknown hulls respectively. Ini- tially, Runknown has a single element which represents the whole state space, while Rsafe, Runsafe are both empty.

Chunk 52

The whole state space can be represented by a large bounded set which is of a probability sufficiently close to 1 (e.g. ≥99.999%) for the inputs, instead of an unbounded support range from Gaussian distributions.

Chunk 53

Then the main loop performs the following steps: (1) It chooses the hull Rnext which has the highest probability in Runknown and delete it from Runknown; (2) Rnext is subdivided using our boundary-aware subdivision strategy which builds a regression tree and the subdivisions are new probabilistic hulls obtained as the tree leafs; (3) We verify the output sets of the subdivisions. We add the safe hulls to Rsafe, the unsafe hulls to Runsafe, and the hulls that are not purely safe or unsafe to Runknown.

Chunk 54

The loop terminates when the summation of the proba- bilities of the hulls in Runknown is < ε. The details are described in the following subsections.

Chunk 55

4.1 Boundary-Aware Subdivision Strategy Uniform Subdivision Unk Unk Safe Unk Boundary-Aware Subdivision Unk Safe Unsafe Unk — Boundary - - Splits • Samples Fig. 2.

Chunk 56

Uniform vs. boundary-aware subdivision.

Chunk 57

Efficiently finding large safe/unsafe hulls plays a key role in our framework. A naive way could be uniformly subdivide the input state space and verify the safety of each piece.

Chunk 58

However, it is often very expensive and ignores the decision boundary. Directly considering the boundary in probabilistic hull search can be --- Page 6 --- 6 J.

Chunk 59

Li et al. difficult, since the boundary is in the output space and the subdivision in per- formed in the input space.

Chunk 60

However, it is still possible to “backpropagate” the information of the boundary to the subdivision process. Figure 2 gives an intu- itive comparison between uniform subdivision and boundary-aware subdivision.

Chunk 61

The red curve indicates the backpropagated boundary to the input space, it is often not computable and unknown to our process. It can be seen from the figure that, using boundary-aware subdivision can better produce large hulls which do not intersect the boundary and are pure safe or unsafe.

Chunk 62

Our boundary-aware subdivision method is outlined in Algorithm 1. Given a DNN f, a region R in the input space, number of samples n and the safety specification defined by the parameters C and a.

Chunk 63

Our whole boundary-ware sub- division process performs the following steps: (1) It samples a point set Sbase with n points within the given region R and check the safety of their outputs; (2) If both safe and unsafe outputs are found, it conducts boundary-aware sampling by rejecting the samples whose outputs are sufficiently far from the boundary, and we obtain a point set S; (3) Otherwise, we set S = Sbase since the boundary might not be in R; (4) We use the samples in S to build a regression tree T in the region R, where each leaf represents a subdivision of R. The details of boundary-aware sampling are given as below.

Chunk 64

Combined Sampling Strategy Foundation. We introduce the setup of our base sampling strategy.

Chunk 65

Our base sampling strategy is the combination of the distribu- tional sampling of the input distribution N(µ, Σ) and the uniform distribution. Given an integer n, we sample ⌊n/2⌋points with distributional sampling and sample n −⌊n/2⌋points with uniform sampling.

Chunk 66

We adopt the combination in- stead of only one of them because (i) we need to take the distribution of the input variables into account since it helps to find probabilistic hulls in the space of high probability density and make the hulls of high probability, while (ii) ad- ditionally considering a uniform distribution helps us to sample the margin of a region, and have balanced safe and unsafe samples. Boundary Awareness Through Elimination.

Chunk 67

In order to better reflect the safe boundary in R, we check our samples to see if both safe and unsafe points are generated. If not, we do not start the rejection sampling step.

Chunk 68

Otherwise, we aim at keeping the samples near the boundary. Although it is difficult to compute the boundary in the input space, the distance of a sample x to the boundary can be measured as the distance between f(x) and the boundary.

Chunk 69

Then we apply an elimination mechanism to remove the samples that are sufficiently far from the boundary. To do so, we associate a probability distribution for eliminating a sample.

Chunk 70

For each sample x, we compute its output f(x) and the distance to the safety boundary Cf(x) −a. Then, the elimination probability is given by Pe(x) = 1 −exp(−rank(|Cf(x) −a|/σ)) where rank returns the corresponding index in the sorted distance from the nearest to the farthest.

Chunk 71

--- Page 7 --- Probabilistic Verification of Neural Networks 7 Algorithm 1: Boundary-Aware Subdivision Input: Region R, number of samples n, network f, C, a Output: Boundary-aware regression tree T // Combined sampling foundation 1 Sbase ←CombinedSample(R, n); // Check if rejection sampling is needed 2 if Sbase contains only safe or unsafe points then // No boundary in region, return base samples 3 return Sbase 4 end // Rejection sampling 5 S ←∅; 6 nattempts ←0; 7 while |S| < n and nattempts < max_attempts do 8 x ←CombinedSample(R, n) 9 xaccept ←EliminateSamples(C, f, x) 10 S ←S ∪{xaccept}; 11 nattempts ←nattempts + 1; 12 end // Fill remaining samples if not enough sample points 13 if |S| < n then 14 S ←S∪CombinedSample(R, n −|S|); 15 end 16 T ←BuildRegressionTree(S, R) 17 return T Regression Tree Construction. We use the sample set after elimination to build a regression tree which is expected to better split the region along the boundary.

Chunk 72

The tree construction follows the standard algorithm [11]. For each non-leaf node, the impurity of the samples is measured as Impurity = MSE(Sleft, Sright) Lα where Sleft, Sright are the split sets, L is the variation of the samples in the dimension of subdivision, and α is a user-specified hyperparameter controlling the geometric penalty.

Chunk 73

Chunk 74

--- Page 8 --- 8 J. Li et al.

Chunk 75

Algorithm 2: The Main Algorithm Input: Network f, C, a, distribution p(x), threshold ε Output: Safe probability interval [Ls, Us] 1 Initialize: Runknown ←[entire_input_space]; 2 Rsafe ←∅, Runsafe ←∅; 3 while P R∈Runknown P(R) ≥ε do // Select region with maximum probability mass 4 Rnext ←SelectLargestProbRegion(Runknown); // Targeted sampling within selected region 5 S ←BoundaryAwareSample(Rnext, small_sample_size); // Build regression tree and extract leaf regions 6 T ←BuildRegressionTree(S, Rnext); 7 {R1, ..., Rk} ←ExtractLeafRegions(T); // Verify each leaf region 8 {s1, ..., sk} ←CROWNVerify({R1, ..., Rk}, C, a); // Update region sets 9 Remove Rnext from Runknown; 10 for i ←1 to k do 11 if si = Safe then 12 Rsafe ←Rsafe ∪{Ri}; 13 else if si = Unsafe then 14 Runsafe ←Runsafe ∪{Ri}; 15 else 16 Runknown ←Runknown ∪{Ri}; 17 end 18 end 19 end 20 Ls ←P R∈Rsafe P(R); 21 Us ←1 −P R∈Runsafe P(R); 22 return [Ls, Us] Algorithm 2 presents the main algorithm that is a detailed treatment of the workflow in Figure 1. It operates through an iterative refinement loop that progressively subdivides the input space.

Chunk 76

Rather than attempting to resolve all unknown hulls equally, we strategically select and refine only the candidates with the highest probability mass among the unknown hulls. 4.2 Analysis of the Framework Correctness of the Result.

Chunk 77

We investigate the correctness of the result from Algorithm 2. First, all of the probabilistic hulls generated intersect at most on the facets.

Chunk 78

To see that, each hull is obtained as a leaf of a regression tree which subdivide a rougher hull into sub-hulls whose intersections are at most the facets. Therefore, each probabilistic hull in the set Rsafe and Runsafe has a probability which is independent from the probabilities represented by the other --- Page 9 --- Probabilistic Verification of Neural Networks 9 hulls.

Chunk 79

Next, we show that Us is an over-estimate of the actual safe probability and Ls is an under-estimate of probability. Due to the conservativeness of the tool CROWN, all hulls in Runsafe are unsafe and all hulls in P R∈Rsafe are safe.

Chunk 80

Since the probabilities of the hulls in Runsafe are independent, we infer that the DNN is at least 1 −Us = P R∈Runsafe P(R) unsafe, and thereby at most Us safe. Analogously, the DNN is at least Ls = P R∈Rsafe P(R) safe since the hulls are safe and representing independent probability.

Chunk 81

We also have that Ls ≤Us. Termination of the Main Algorithm.

Chunk 82

We show that Algorithm 2 will eventually terminate with Us−Ls < ε. Our main algorithm can be viewed as a new branch- and-bound scheme with probability density and boundary-aware subdivision.

Chunk 83

During the iterations, the sum of the volumes of the hulls in the unknown set Runknown becomes smaller and smaller due to the repeated refinement. A proof for the termination can be directly obtained by adapting the termination proof of the standard branch-and-bound method [6].

Chunk 84

The volume of the region defined by the union of all undecided hulls, i.e., the hulls in the set of Runknown, converges to zero when the number of iterations goes to infinity. Hence, the total probability of all hulls in Runknown will eventually be below the given threshold ε after a finite number of iterations.

Chunk 85

It is also the probability that lies out of the confirmed safe and unsafe probability estimates, i.e., Us −Ls < ε. Theorem 2.

Chunk 86

Algorithm 2 terminates with an interval that contains the actual safe probability of the DNN. When Algorithm 2 terminates, the union of the hulls in Runknown forms an overapproximation of the image of the safety boundary via the inverse mapping of the DNN.

Chunk 87

5 Experiments 5.1 Experimental Setup Platform and Tools. Our experiment is conducted on a Linux server running Ubuntu 18.04, equipped with an Intel Xeon E5-2678 v3 processor and a GPU of Nvidia RTX 4090.

Chunk 88

We also use auto_LiRPA [21] library to compute the bounds, and our implementation takes as input neural networks in PyTorch format. The safety verification of probabilistic hulls are performed by CROWN [23].

Chunk 89

Our full source code, scripts, and necessary experimental data are publicly available on GitHub at https://github.com/LIJYann/Regression-Tree-Guided- Probabilistic-Hull-Generation. The repository corresponds to the code version used for this submission, ensuring reproducibility.

Chunk 90

Comparison Baselines. We compare our approach with two state-of-the-art tools, ProbStar [14] and the basic branch-and-bound (BaB) method [2] (binary partition on the longest dimension).

Chunk 91

The comparison is done on three groups of DNN benchmarks: ACAS Xu, the rocket lander controller benchmark for SpaceEx Falcon9, and the tanh-based DNNs distilled from those in ACAS Xu --- Page 10 --- 10 J. Li et al.

Chunk 92

(see [4]). We use a less restrictive condition, maxR∈Runknown P(R) ≤ε, for the termination condition in order to reach a reasonable tradeoff between efficiency and accuracy.

Chunk 93

This condition is used in both our method and the BaB in order to give a fair comparison. The parallel computation in BaB and our approach are conducted on GPU with auto_LiRPA.

Chunk 94

However, due to the implementation of ProbStar, it can only be parallelized on CPU, and we use the default parallel setting for ProbStar. The experimental results from the tools are compared based on the runtime and the tightness of the upper and lower bounds for the actual safe probability.

Chunk 95

Hyperparameter Settings. To systematically evaluate our regression-tree- guided probabilistic verification method across different parameter configura- tions, we conducted a comprehensive grid search on each benchmark.

Chunk 96

We test all combinations of key parameters through their Cartesian product, enabling us to identify optimal configurations. We examined three critical parameters: sampling strategy, regression tree depth, and impurity hyperparameter settings.

Chunk 97

Overall, the configurations can be described as ((wuniform, wdistribution), depth, (α, β)): – Sampling strategy: Weight configurations balancing uniform and distribu- tional sampling (wuniform, wdistribution); – Regression tree depth: Depth limits on regression tree; – Impurity scheduler: Combined the proposed α-weighted measure with longest- dimension splitting. Initial splits follow the longest dimension until the hull probability reaches threshold β, then transition to the soft-coded impurity measure (α, β).

Chunk 98

Fig. 3.

Chunk 99

Boundary-aware subdivision for the toy example. Illustrative Toy Examples.

Chunk 100

Before we show the comparisons, Figure 3 il- lustrates the probabilistic hulls computed for the toy example described in [14]. --- Page 11 --- Probabilistic Verification of Neural Networks 11 The red boxes are the probabilistic hulls that are identified as unsafe.

Chunk 101

The green boxes are the safe hulls. While the yellow ones are the boxes whose corresponding outputs intersect the safety boundary.

Chunk 102

The union of the yellow boxes forms an overapproximation of the image of the safety boundary under the inverse map- ping of the neural network. It can be seen that the state far from the boundary is often included by large probabilistic hulls, i.e., some unnecessary subdivisions can be avoided using our boundary-aware method.

Chunk 103

The rest of the section ad- dresses the experimental evaluation goals as mentioned previously. 5.2 ACAS Xu Benchmark The ACAS Xu collision avoidance system benchmarks [8] consist of 45 networks (5×9 configurations) with 5 inputs, 6 layers, and 50 neurons per layer.

Chunk 104

We test Property P2 in the benchmarks which requires the network not to output “COC” as the optimal action. This means the network output is unsafe when the first output is the minimal.

Chunk 105

In this comparison, all methods filter regions (branches in ProbStar’s case) smaller than 10−5. We use 1000 initial sampling points and 100 iterative sampling points in our approach.

Chunk 106

As for the hyperparameters, we performed the grid search on ACAS network 1-6 for specification 2 and determine the optimal configuration to be pure dis- tributional sampling (0.0,1.0), depth 5, β = 0.75, α = 0.05. We therefore adopt these hyperparameter settings for the ACAS benchmark.

Chunk 107

The full illustration of the grid search can be viewed in Section A. Table 1 shows the results for all tools, where the best bounds and runtimes are marked in bold (and for all subsequent tables).

Chunk 108

From the table, we observe that BaB times out on all the benchmarks, and our method mostly generates better bounds than ProbStar. Specifically, our bounds uniformly outperforms ProbStar in the bounds for unknown probabilities Us −Ls, and our runtime is comparable with ProbStar.

Chunk 109

Besides, the parallel paradigm originated from CROWN yields substantial performance gains, with an average speedup of 4.8× compared to our non-parallel version. 5.3 Rocket Lander Controller Benchmark To further demonstrate the ability of our method to handle high-dimensional neural networks, we conduct evaluation on the Rocket Lander [22], which sim- ulates the vertical landing control of a SpaceX Falcon 9 first-stage rocket.

Chunk 110

The objective is to safely land the rocket on a sea-based platform from a given height, which requires precise control of both the rocket’s velocity and lateral angle. The control system outputs three components: a main engine thruster (FE ∈ [0, 1]) with an adjustable angle (ϕ), and two side nitrogen thrusters (FS ∈[−1, 1], where -1 and 1 indicate full throttle of right and left thruster respectively).

Chunk 111

The system state is represented by a 9-dimensional vector: (x, y, vx, vy, θ, ω, F ′ E, F ′ S, ϕ′), where (x, y) is the position relative to the landing center; (vx, vy) are the hor- izontal and vertical velocities; θ is the lateral angle; ω is the angular velocity; and (F ′ E, F ′ S, ϕ′) are the previous control actions. --- Page 12 --- 12 J.

Chunk 112

Li et al. Table 1.

Chunk 113

Results on ACAS Xu Benchmark for Property P2. Legend - Net: DNN benchmarks, Approach: our approach ("Ours"), ProbStar and BaB, Ls: the lower bound of the safety probability interval, Us: the upper bound of the safety probability interval, Us −Ls: upper bound for the unknown probability, Time (non parallel): the runtime without parallel computing in seconds, Time (parallel): the runtime with par- allel computing in seconds, T.O.: Time out, >2 hours.

Chunk 114

Net Approach Ls Us Us −Ls Time (non parallel) Time (parallel) 1-6 Ours 0.984714 1.000000 0.015286 124.21 46.58 1-6 ProbStar 0.933739 0.997192 0.063453 2840.35 315.75 1-6 BaB - - - T.O. T.O.

Chunk 115

2-2 Ours 0.924122 0.991622 0.067500 1248.70 240.79 2-2 ProbStar 0.892540 0.980435 0.087895 3869.90 430.20 2-2 BaB - - - T.O. T.O.

Chunk 116

2-9 Ours 0.954046 0.999941 0.045895 819.22 249.24 2-9 ProbStar 0.879616 0.999745 0.120129 6715.75 746.56 2-9 BaB - - - T.O. T.O.

Chunk 117

3-1 Ours 0.923320 0.972056 0.048735 949.94 189.76 3-1 ProbStar 0.913943 0.969500 0.055557 2462.36 273.73 3-1 BaB - - - T.O. T.O.

Chunk 118

3-6 Ours 0.914964 0.982248 0.067284 1808.11 359.85 3-6 ProbStar 0.879708 0.979217 0.099509 5191.90 577.16 3-6 BaB - - - T.O. T.O.

Chunk 119

3-7 Ours 0.923570 0.999888 0.076318 1817.89 386.33 3-7 ProbStar 0.911578 0.997681 0.086103 4187.35 465.49 3-7 BaB - - - T.O. T.O.

Chunk 120

4-1 Ours 0.961700 0.999953 0.038253 884.45 155.31 4-1 ProbStar 0.914955 0.998963 0.084008 3120.21 346.86 4-1 BaB - - - T.O. T.O.

Chunk 121

4-7 Ours 0.913432 0.979347 0.065914 1604.20 381.73 4-7 ProbStar 0.878505 0.979215 0.100710 3917.94 435.54 4-7 BaB - - - T.O. T.O.

Chunk 122

5-3 Ours 0.973373 1.000000 0.026627 336.51 81.21 5-3 ProbStar 0.953952 1.000000 0.046048 1306.79 145.27 5-3 BaB - - - T.O. T.O.

Chunk 123

--- Page 13 --- Probabilistic Verification of Neural Networks 13 We directly use the control policy network provided in ProbStar. The network has 9 inputs, 3 outputs, and 5 hidden layers containing 20 ReLU neurons each.

Chunk 124

We focus on two critical safety properties that ensure proper tilt control: – P1 (Right Tilt Prevention): For states where θ ∈[−20◦, −6◦], ω < 0, ϕ′ ≤0, and F ′ S ≤0, the network must output either ϕ < 0 or FS < 0. – P2 (Left Tilt Prevention): For states where θ ∈[6◦, 20◦], ω ≥0, ϕ′ ≥0, and F ′ S ≥0, the network must output either ϕ > 0 or FS > 0.

Chunk 125

As for the hyperparameters, we performed the grid search on Net 1 for specification 1 and determine the optimal configuration to be mixed sampling (wuniform = 0.25, wdistribution = 0.75), depth 5, β = 0.0, α = 0.05. We there- fore adopt these hyperparameter settings for this benchmark.

Chunk 126

The scatter point figure of the grid search can be viewed in Section A. We conduct experiments on the Rocket Lander controller to compare our method with ProbStar.

Chunk 127

In this comparison, all methods filter regions (branches in ProbStar’s case) smaller than 10−3. We do not run BaB since it cannot handle high-dimensional input due to inherent combinatorial explosion.

Chunk 128

For our method, we use 9000 initial sampling points and 900 iterative sampling points to adapt to the high input dimension of this benchmark. Similar to the setting in Table 1, we compare both the bounds and the runtime (including parallel and non-parallel).

Chunk 129

The results are shown in Table 2 (non-parallel) and Table 3 (parallel). From these tables, we observe that our method consistently generates significantly tighter bounds for the unknown probability (Us −Ls) across all configurations.

Chunk 130

Further- more, our method demonstrates superior time efficiency, achieving substantial runtime reductions in both non-parallel and parallel settings. Table 2.

Chunk 131

Experiments on Rocket Lander Controller (non-parallel). Net & Prop means different configurations in the original benchmark.

Chunk 132

Others are the same as Table 1. Net & Prop Approach Us −Ls Time (s) 0 & 1 ProbStar 0.715046 418.183 0 & 1 Ours 0.650548 171.933 0 & 2 ProbStar 0.831752 597.775 0 & 2 Ours 0.262993 85.482 1 & 1 ProbStar 0.554032 438.278 1 & 1 Ours 0.229874 70.723 1 & 2 ProbStar 0.356913 319.571 1 & 2 Ours 0.079113 23.287 5.4 Efficiency of Finding Probabilistic Hulls We compare the efficiency of our method and the BaB on finding safe and unsafe probabilistic hulls, since both of the methods use probabilistic hulls to estimate --- Page 14 --- 14 J.

Chunk 133

Li et al. Table 3.

Chunk 134

Experiments on Rocket Lander Controller (parallel). The legend is the same as Table 2.

Chunk 135

Net & Prop Approach Us −Ls Time (s) 0 & 1 ProbStar 0.715046 141.047 0 & 1 Ours 0.651642 33.767 0 & 2 ProbStar 0.831752 225.331 0 & 2 Ours 0.265237 23.573 1 & 1 ProbStar 0.554032 225.331 1 & 1 Ours 0.230962 18.900 1 & 2 ProbStar 0.356913 117.494 1 & 2 Ours 0.079063 12.947 the safe probability. We use the distilled DNNs [4] of the ones in the ACAS Xu set.

Chunk 136

All of the new DNNs have tanh activation function. Moreover, we consider the parallel setting for both our method and BaB.

Chunk 137

The hyperparameters are the same as in the original ACAS Xu experiments. The comparison is made in the following way.

Chunk 138

We set up an end time of 1800 seconds to let both of the methods find probabilistic hulls. Each of the methods may terminate either at the end time or when the early termination condition holds (i.e., the highest probability of the undetermined sets is less than the given threshold ε, similar to maxR∈Runknown P(R) ≤ε).

Chunk 139

Then we compare the runtime and the estimated upper and lower bounds for the safe probability. Clearly, the tighter the bounds, the better the result.

Chunk 140

We use ε = 10−5, and the experimental results are given in Table 4. It can be seen that our method always stops before the end time and the runtime are much lower than the basic BaB method.

Chunk 141

On the other hand, the bounds obtained from our method are also much tighter than those from the basic BaB ones. Hence, our method is much more efficient in finding safe and unsafe probabilistic hulls.

Chunk 142

5.5 Discussion Basically, our method can handle most of the feedforward DNNs with various activation functions, since we use the given DNN as a blackbox to produce the samples. This is a clear advantage over the state-of-the-art tools such as ProbStar which requires neural networks only to have ReLU activation functions.

Chunk 143

On the other hand, our boundary-aware subdivision demonstrate a great performance improvement in not only time efficiency but also accuracy comparing to the basic branch-and-bound method. In the experiments, we also find two weaknesses of our method.

Chunk 144

First, the total runtime highly depends on the time spent by CROWN which is used to verify the safety of a hull in the input space. It can be improved by designing a more efficient output range analysis technique which is particularly for the hulls --- Page 15 --- Probabilistic Verification of Neural Networks 15 Table 4.

Chunk 145

Efficiency of Finding Probabilistic Hulls (parallel). Legend - Time: runtime in seconds, End: end time = 1800 seconds, Others are the same as those in Table 1.

Chunk 146

Net Approach Ls Us Us −Ls Time (s) 1-6 Ours 0.9815 0.9995 0.0180 264.00 1-6 BaB 0.9397 1.0000 0.0603 End 2-2 Ours 0.9819 0.9997 0.0178 142.30 2-2 BaB 0.9616 1.0000 0.0384 End 2-9 Ours 0.9502 0.9994 0.0492 686.66 2-9 BaB 0.8896 1.0000 0.1104 End 3-1 Ours 0.9865 1.0000 0.0135 68.06 3-1 BaB 0.9845 1.0000 0.0155 1413.22 3-6 Ours 0.9293 0.9976 0.0682 1188.63 3-6 BaB 0.7738 1.0000 0.2262 End 3-7 Ours 0.8673 0.9448 0.0775 1338.55 3-7 BaB 0.7682 1.0000 0.2318 End 4-1 Ours 0.9864 1.0000 0.0136 59.52 4-1 BaB 0.9837 1.0000 0.0163 1019.67 4-7 Ours 0.9778 1.0000 0.0222 397.20 4-7 BaB 0.8993 1.0000 0.1007 End 5-3 Ours 0.9823 1.0000 0.0177 303.09 5-3 BaB 0.8886 1.0000 0.1114 End in our problem. Second, our method still suffers from the curse of dimensional- ity in the worst case, although the boundary-aware subdivision helps a lot on avoiding subdividing the non-critical regions.

Chunk 147

This aspect can be improved by considering symbolic representations of the hulls, such as Taylor models [5]. Besides, we observed that some hulls only have a very small intersection with the boundary but requires many subdivisions to find purely safe or unsafe subsets.

Chunk 148

For such a hull, we consider to develop shrinking methods that can fast find subsets by contracting the original hull. 6 Conclusion We present a novel approach of probabilistic verification of DNNs and the main idea is to efficiently find safe or unsafe probabilistic hulls using a boundary-aware subdivision strategy.

Chunk 149

Our method does not have a limitation to particular acti- vation functions and shows a much better performance than the similar existing techniques. In the future, we seek to improve the method by (i) developing more ef- ficient output range computation techniques particularly for the probabilistic hulls, to make the verification step less time-consuming; (ii) designing new sym- bolic representations for the hulls such that they are not limited to boxes or parallelotopes.

Chunk 150

--- Page 16 --- 16 J. Li et al.

Chunk 151

References 1. Bain, L.J., Engelhardt, M.: Introduction to Probability and Mathematical Statis- tics (2nd ed.).

Chunk 152

Cengage Learning (1992) 2. Boetius, D., Leue, S., Sutter, T.: Probabilistic verification of neural networks using branch and bound.

Chunk 153

CoRR abs/2405.17556 (2024) 3. Cardelli, L., Kwiatkowska, M., Laurenti, L., Paoletti, N., Patane, A., Wicker, M.: Statistical guarantees for the robustness of bayesian neural networks.

Chunk 154

In: Proceed- ings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019. pp.

Chunk 155

5693–5700 (2019) 4. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey.

Chunk 156

Int. J.

Chunk 157

Comput. Vis.

Chunk 158

129(6), 1789–1819 (2021) 5. Huang, C., Fan, J., Chen, X., Li, W., Zhu, Q.: POLAR: A polynomial arithmetic framework for verifying neural-network controlled systems.

Chunk 159

In: Bouajjani, A., Holík, L., Wu, Z. (eds.) Proceedings of the 20th International Symposium on Automated Technology for Verification and Analysis.

Chunk 160

Lecture Notes in Computer Science, vol. 13505, pp.

Chunk 161

414–430. Springer (2022) 6.

Chunk 162

Jaulin, L., Kieffer, M., Didrit, O., Walter, E.: Applied Interval Analysis. Springer (2001) 7.

Chunk 163

Katz, G., Barrett, C.W., Dill, D.L., Julian, K.D., Kochenderfer, M.J.: Reluplex: An efficient SMT solver for verifying deep neural networks. In: The 29th International Conference on Computer Aided Verification (CAV 2017).

Chunk 164

vol. 10426, pp.

Chunk 165

97–117. Springer (2017) 8.

Chunk 166

Manfredi, G., Jestin, Y.: An introduction to acas xu and the challenges ahead. In: 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC).

Chunk 167

pp. 1–9 (2016) 9.

Chunk 168

Mangal, R., Nori, A.V., Orso, A.: Robustness of neural networks: a probabilis- tic and practical approach. In: Proceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results, ICSE (NIER) 2019, Canada.

Chunk 169

pp. 93–96 (2019) 10.

Chunk 170

Meng, M.H., Bai, G., Teo, S.G., Hou, Z., Xiao, Y., Lin, Y., Dong, J.S.: Adversarial robustness of deep neural networks: A survey from a formal verification perspective. CoRR abs/2206.12227 (2022) 11.

Chunk 171

Mitchell, T.M.: Machine Learning. McGraw-Hill Education (1997) 12.

Chunk 172

Pilipovsky, J., Sivaramakrishnan, V., Oishi, M., Tsiotras, P.: Probabilistic verifica- tion of relu neural networks via characteristic functions. In: Learning for Dynamics and Control Conference, L4DC 2023, 15-16 June 2023, Philadelphia, PA, USA.

Chunk 173

Proceedings of Machine Learning Research, vol. 211, pp.

Chunk 174

966–979 (2023) 13. Tambon, F., Laberge, G., An, L., Nikanjam, A., Mindom, P.S.N., Pequignot, Y., Khomh, F., Antoniol, G., Merlo, E., Laviolette, F.: How to certify machine learning based safety-critical systems?

Chunk 175

A systematic literature review. Autom.

Chunk 176

Softw. Eng.

Chunk 177

29(2), 38 (2022) 14. Tran, H., Choi, S., Okamoto, H., Hoxha, B., Fainekos, G., Prokhorov, D.V.: Quan- titative verification for neural networks using probstars.

Chunk 178

In: Proceedings of the 26th ACM International Conference on Hybrid Systems: Computation and Con- trol, HSCC 2023, USA,. pp.

Chunk 179

4:1–4:12 (2023) 15. Wang, S., Zhang, H., Xu, K., Lin, X., Jana, S.S., Hsieh, C.J., Kolter, J.Z.: Beta- crown: Efficient bound propagation with per-neuron split constraints for neural network robustness verification.

Chunk 180

In: 34th Annual Conference on Neural Information Processing Systems, (NeurIPS 2021). pp.

Chunk 181

29909–29921 (2021) --- Page 17 --- Probabilistic Verification of Neural Networks 17 16. Webb, S., Rainforth, T., Teh, Y.W., Kumar, M.P.: A statistical approach to as- sessing neural network robustness.

Chunk 182

In: 7th International Conference on Learning Representations, USA (2019) 17. Weng, T.W., Chen, P.Y., Nguyen, L.M., Squillante, M.S., Boopathy, A., Oseledets, I., Daniel, L.: Proven: Certifying robustness of neural networks with a probabilistic approach 97 (2019) 18.

Chunk 183

Wicker, M., Laurenti, L., Patane, A., Paoletti, N., Abate, A., Kwiatkowska, M.: Probabilistic reach-avoid for bayesian neural networks. Artif.

Chunk 184

Intell. 334, 104132 (2024) 19.

Chunk 185

Wicker, M., Patane, A., Laurenti, L., Kwiatkowska, M.: Adversarial robustness certification for bayesian neural networks. In: Formal Methods - 26th International Symposium, FM 2024, Milan, Italy, September 9-13, 2024, Proceedings, Part I.

Chunk 186

Lecture Notes in Computer Science, vol. 14933, pp.

Chunk 187

3–28 (2024) 20. Wu, H., Isac, O., Zeljic, A., Tagomori, T., Daggitt, M.L., Kokke, W., Refaeli, I., Amir, G., Julian, K., Bassan, S., Huang, P., Lahav, O., Wu, M., Zhang, M., Komendantskaya, E., Katz, G., Barrett, C.W.: Marabou 2.0: A versatile formal analyzer of neural networks.

Chunk 188

In: Computer Aided Verification - 36th International Conference, CAV 2024, Montreal, QC, Canada, July 24-27, 2024, Proceedings, Part II. Lecture Notes in Computer Science, vol.

Chunk 189

14682, pp. 249–264 (2024) 21.

Chunk 190

Xu, K., Shi, Z., Zhang, H., Wang, Y., Chang, K., Huang, M., Kailkhura, B., Lin, X., Hsieh, C.: Automatic perturbation analysis for scalable certified robustness and beyond. In: Advances in Neural Information Processing Systems 33: Annual Con- ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020) 22.

Chunk 191

Yang, X., Yamaguchi, T., Tran, H., Hoxha, B., Johnson, T.T., Prokhorov, D.V.: Neural network repair with reachability analysis. In: Formal Modeling and Analy- sis of Timed Systems - 20th International Conference, FORMATS 2022, Warsaw, Poland, September 13-15, 2022, Proceedings.

Chunk 192

Lecture Notes in Computer Science, vol. 13465, pp.

Chunk 193

221–236. Springer (2022) 23.

Chunk 194

Zhang, H., Weng, T.W., Chen, P.Y., Hsieh, C.J., Daniel, L.: Efficient neural net- work robustness certification with general activation functions. In: 31st Annual Conference on Neural Information Processing Systems 2018, (NeurIPS 2018).

Chunk 195

pp. 4944–4953 (2018) A Hyperparameter and Grid Search We performed an extensive grid search to determine the best configuration for the core components of our approach: the sampling strategy, the impurity scheduler, and the regression tree depth.

Chunk 196

A.1 ACAS Xu Benchmark For the ACAS Xu benchmark, the search space was defined as follows. – Sampling Weights (wuniform, wdistribution): Five configurations were tested.

Chunk 197

(0.0, 1.0), (0.25, 0.75), (0.5, 0.5), (0.75, 0.25), and (1.0, 0.0). – Regression Tree Depth: Two values were considered.

Chunk 198

5 and 10. – Impurity Splitting (β): Five values were tested.

Chunk 199

0, 0.25, 0.5, 0.75, and 1.0. --- Page 18 --- 18 J.

Chunk 200

Li et al. – Impurity Coefficient (α): Six values were evaluated.

Chunk 201

0.05, 0.1, 0.3, 0.5, 1.0, and 2.0. This parameter space resulted in 25 impurity configurations and a total of 250 parameter combinations.

Chunk 202

The evaluation metrics for each combination were the remaining unknown region probability (Us −Ls) and the corresponding runtime. The grid search was performed on ACAS network 1-6 for specification 2.

Chunk 203

Figure 4 presents the results, mapping the trade-off between Us −Ls (vertical axis) and runtime (horizontal axis). Each point represents a distinct configura- tion.

Chunk 204

The red star highlights the Pareto optimal configuration, which outper- forms all other combinations in both metrics. Gray points indicate configurations that are dominated by the Pareto solution in at least one metric.

Chunk 205

The search identified the optimal configuration for the ACAS benchmark as pure distributional sampling (0.0, 1.0), a regression tree depth of 5, β = 0.75, and α = 0.05. These settings were subsequently adopted for the main experiments on the ACAS Xu benchmark.

Chunk 206

0 200 400 600 800 1000 1200 Runtime (seconds) 0.05 0.10 0.15 0.20 Us-Ls Pareto Optimal ((0.00, 1.00), 5, (0.05, 0.75)) Other Configurations Pareto Optimal (Best & Fastest) Fig. 4.

Chunk 207

Scatter plot showing the relationship between Us-Ls and runtime for 250 dif- ferent parameter configurations tested in the grid search. A.2 Rocket Lander Benchmark The same systematic grid search methodology was applied to the Rocket Lander controller.

Chunk 208

The search space for this benchmark was defined as follows. – Sampling Weights (wuniform, wdistribution).

Chunk 209

The same five configurations were tested. (0.0, 1.0), (0.25, 0.75), (0.5, 0.5), (0.75, 0.25), and (1.0, 0.0).

Chunk 210

--- Page 19 --- Probabilistic Verification of Neural Networks 19 20 40 60 80 100 Runtime (seconds) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Us-Ls Pareto Optimal ((0.25, 0.75), 5, (0.05, 0.00)) Other Configurations Pareto Optimal (Best & Fastest) Fig. 5.

Chunk 211

Scatter plot showing the relationship between Us-Ls and runtime for 300 dif- ferent parameter configurations tested in the grid search. – Regression Tree Depth.

Chunk 212

Three values were considered. 5, 10, and 20.

Chunk 213

– Impurity Splitting (β). Five values were tested.

Chunk 214

0, 0.25, 0.5, 0.75, and 1.0. – Impurity Coefficient (α).

Chunk 215

Five values were evaluated. 0.05, 0.1, 0.3, 0.5, and 1.0.

Chunk 216

This yielded a total of 300 parameter combinations. The grid search was per- formed on Rocket Net 1 for specification 1.

Chunk 217

Figure 5 presents the corresponding Pareto front analysis. Consistent with the ACAS results, the Us −Ls is plotted against the runtime, and the red star marks the Pareto optimal configuration.

Chunk 218

The optimal configuration determined for the Rocket Lander benchmark was a *mixed sampling strategy (0.25, 0.75), a regression tree depth of 5, β = 0.0, and α = 0.05**. These settings were adopted for the main experiments on the Rocket Lander benchmark.

Chunk 219

B Ablation Studies We conduct a comprehensive ablation study to analyze the independent con- tribution and combinatorial effects of the key components in our proposed ap- proach. The components investigated include the weighted mixed sampling and the impurity scheduler defined by α and β in our hyperparameter settings.

Chunk 220

The ablation studies are conducted on both the high-dimensional Rocket Lander and the moderate-dimensional ACAS Xu benchmarks. The following sections detail the findings, quantifying the impact of each mechanism on verification tightness (Us −Ls) and time efficiency across these diverse benchmarks.

Chunk 221

--- Page 20 --- 20 J. Li et al.

Chunk 222

Table 5. Ablation Study on Rocket Lander and ACAS Xu (Average Results).

Chunk 223

The table compares the average unknown probability (Us −Ls) and average runtime for different component configurations on two benchmarks. The weights are wu (uniform) and wd (distributional).

Chunk 224

Configuration (wu, wd) Rocket Lander ACAS Xu Avg. Us −Ls Avg.

Chunk 225

Time (s) Avg. Us −Ls Avg.

Chunk 226

Time (s) Baseline Configurations (Impurity Scheduler Disabled) Baseline (1.00, 0.00) 0.9194 30.26 0.0650 279.46 (0.00, 1.00) 0.9242 24.86 0.0681 279.05 (0.75, 0.25) 0.9134 30.74 0.0652 277.95 (0.50, 0.50) 0.9124 29.22 0.0664 279.09 (0.25, 0.75) 0.9124 27.80 0.0662 278.14 Configurations with Impurity Scheduler Enabled + Impurity Scheduler (1.00, 0.00)∗ 0.2466 23.17 0.0341 211.61 (0.00, 1.00) 0.3294 25.23 0.0368 221.05 (0.75, 0.25) 0.2371 21.25 0.0340 210.33 (0.50, 0.50) 0.2705 22.02 0.0344 214.28 (0.25, 0.75) 0.2576 21.71 0.0350 218.96 ∗For ACAS Xu, the impurity scheduler uses β = 0.75 and α = 0.05 as the representative Impurity-only configuration for comparison. For Rocket Lander, the impurity scheduler uses β = 0.0 and α = 0.05 as the representative Impurity-only configuration for comparison.

Chunk 227

B.1 Contribution of Sampling and Impurity Scheduling The analysis of results presented in Table 5 focuses on the core question of whether sampling alone is effective and how the Impurity Scheduler contributes to the final performance. The data reveals that the Impurity Scheduler is the critical component for achieving high verification accuracy (Us −Ls).

Chunk 228

When the Impurity Scheduler is disabled, varying the sampling ratio only provides marginal improvements in Us −Ls. For the Rocket Lander, the best sampling ratio (wu = 0.25 or wu = 0.50) achieves 0.9124, barely better than the worst sampling ratio (0.9194).

Chunk 229

For ACAS Xu, the sampling variance shows negligible impact on accuracy. However, enabling the Impurity Scheduler changes the performance profile.

Chunk 230

Firstly, for the pure uniform sampling case (wu = 1.00), Us −Ls is reduced from 0.9194 to 0.2466 for Rocket Lander and from 0.0650 to 0.0341 for ACAS Xu. This confirms that the scheduler can significantly tighten the verification bounds.

Chunk 231

Furthermore, when the Impurity Scheduler is active, optimizing the mixed sampling ratio becomes relevant, providing a measurable gain. For both bench- marks, the (0.75, 0.25) mixed ratio achieves the tightest overall bound (0.2371 for Rocket Lander and 0.0340 for ACAS Xu).

Chunk 232

This indicates that while sampling is not the main driver of accuracy, an optimal mix of uniform and distributional sampling is necessary for the final refinement. --- Page 21 --- Probabilistic Verification of Neural Networks 21 Table 6.

Chunk 233

Ablation Study on Impurity Scheduler Threshold (β). The table compares the performance (Us −Ls and Runtime) of the Impurity Scheduler by varying the threshold β, which controls the transition from longest-dimension splitting (β = 1.0) to soft-coded impurity splitting (β < 1.0).

Chunk 234

Only pure uniform sampling (wu = 1.00) is used. Threshold β Splitting Mode Rocket Lander ACAS Xu Avg.

Chunk 235

Us −Ls Avg. Time (s) Avg.

Chunk 236

Us −Ls Avg. Time (s) Baseline (β Implied) Longest-Dimension 0.9194 30.26 0.0650 279.46 0.75 Late Activation 0.9194 30.24 0.0341 211.61 0.50 Mid Activation 0.9194 30.53 0.0342 195.23 0.25 Early Activation 0.9194 30.29 0.0378 210.71 0.00 Immediate Soft Impurity 0.2466 23.17 0.0610 349.11 Regarding runtime efficiency, the runtime for nearly all configurations, re- mains within a similar order of magnitude on both benchmarks.

Chunk 237

For instance, the Rocket Lander configurations generally fall between 21s and 31s, and the ACAS Xu configurations are mostly between 210s and 280s. This suggests that the major performance contribution of the Impurity Scheduler lies in the massive improvement of accuracy (Us −Ls) convergence, while overall execution time remains relatively constant across effective configurations.

Chunk 238

B.2 Ablation on Impurity Scheduler Threshold (β) Given the critical role of the Impurity Scheduler, we conduct a focused ablation study by varying the threshold β, which controls the strategy for transitioning from standard longest-dimension splitting to the proposed soft-coded impurity measure (α-weighted splitting). The results, using pure uniform sampling (wu = 1.00), are summarized in Table 6.

Chunk 239

The optimal transition strategy is found to be benchmark-dependent. For Rocket Lander benchmark, the critical finding is that the Immediate Soft Impurity (β = 0.00) mode is optimal.

Chunk 240

This configuration achieves the tightest bound of 0.2466 and the best runtime of 23.17s within this set. This suggests that the longest-dimension splitting provides minimal structural benefit, and immediate application of the boundary-aware measure is key.

Chunk 241

As for ACAS Xu, a hybrid splitting strategy is essential. The Immediate Soft Impurity (β = 0.00) yields a poor Us −Ls of 0.0610 and the longest runtime of 349.11s.

Chunk 242

In contrast, the best accuracy (0.0341) is achieved when β = 0.75 (Late Activation), allowing the longest-dimension splitting to run for the initial portion of the verification. The best runtime (195.23s) is achieved at β = 0.50 (Mid Activation).

Chunk 243

This confirms that for ACAS Xu, initial global refinement via standard splitting significantly accelerates the process before the fine-grained, boundary-aware splits are activated. The distinct optimal β values can be explained by the fundamental differences in input dimensionality and the subsequent geometric challenges presented to the verification method (which relies on convex hull approximations): --- Page 22 --- 22 J.

Chunk 244

Li et al. 1.

Chunk 245

Geometric Control: In methods utilizing convex hull approximations, the Longest-Dimension splitting is employed primarily to maintain a low inter- dimension ratio for the divided regions. Preventing regions from becoming highly elongated is vital, as this mitigates the over-approximation error that occurs when hull methods are applied to non-square regions.

Chunk 246

2. High Dimensionality (Rocket Lander): The 9-dimensional space of Rocket Lander limits the effectiveness of pure Longest-Dimension splitting.

Chunk 247

The soft impurity measure from the beginning is necessary because its ge- ometric penalty term (Lα) simultaneously guides the split toward reducing boundary uncertainty (via MSE) and acts to regularize the region’s aspect ratio by penalizing splits across large dimensions (L). This combined effi- ciency makes immediate activation optimal, as the structural splitting stage provides little benefit.

Chunk 248

3. Initial Refinement Need (ACAS Xu): In the lower 5-dimensional ACAS Xu space, the Longest-Dimension splitting remains an efficient method for achieving an initial, coarse, global subdivision.

Chunk 249

This initial phase (enabled by β = 0.75 or β = 0.50) establishes regions with more uniform aspect ratios. This preprocessing ensures that when the fine-grained soft impurity measure is activated later, it operates on geometrically well-behaved regions, thereby improving both accuracy and overall convergence speed.

Chunk 250

This analysis confirms that our impurity scheduler is a highly effective compo- nent, but its optimal configuration requires careful tuning based on the specific dimensional and structural characteristics of the benchmark.

Back to search