Quantum Gate Pattern Recognition and Circuit Optimization for Scientific Applications

There is no unique way to encode a quantum algorithm into a quantum circuit. With limited qubit counts, connectivities, and coherence times, circuit optimization is essential to make the best use of near-term quantum devices. We introduce two separate ideas for circuit optimization and combine them in a multi-tiered quantum circuit optimization protocol called AQCEL. The first ingredient is a technique to recognize repeated patterns of quantum gates, opening up the possibility of future hardware co-optimization. The second ingredient is an approach to reduce circuit complexity by identifying zero- or low-amplitude computational basis states and redundant gates. As a demonstration, AQCEL is deployed on an iterative and efficient quantum algorithm designed to model final state radiation in high energy physics. For this algorithm, our optimization scheme brings a significant reduction in the gate count without losing any accuracy compared to the original circuit. Additionally, we have investigated whether this can be demonstrated on a quantum computer using polynomial resources. Our technique is generic and can be useful for a wide variety of quantum algorithms.


Introduction
Recent technology advances have resulted in a variety of universal quantum computers that are being used to implement quantum algorithms.
Koji Terashi: koji.terashi@cern.ch However, these noisy-intermediate-scale quantum (NISQ) devices [53] may not have sufficient qubit counts or qubit connectivity and may not have the capability to stay coherent for entirety of the operations in a particular algorithm implementation. Despite these challenges, a variety of applications have emerged across science and industry. For example, there are many promising studies in experimental and theoretical high energy physics (HEP) for exploiting quantum computers. These studies include event classification [5,7,24,42,64,72], reconstructions of charged particle trajectories [2,55,65,71] and physics objects [13,69], unfolding measured distributions [12] as well as simulation of multiparticle emission processes [47,54]. A common feature of all of these algorithms is that only simplified versions can be run on existing hardware due to the limitations mentioned above.
There are generically two strategies for improving the performance of NISQ computers to execute existing quantum algorithms. One strategy is to mitigate errors through active or passive modifications to the quantum state preparation and measurement protocols. For example, readout errors can be mitigated through postprocessing steps [4,6,8,9,14,15,19,20,22,26,29,31,35,39,40,46,49,60,66,70] and gate errors can be mitigated by systematically enlarging errors before extrapolating to zero error [16,18,30,34,51,63]. A complementary strategy to error mitigation, that is the focus of this paper, is circuit optimization, also known as circuit compilation. In particular, there is no unique way to encode a quantum algorithm into a set of gates, and certain realizations of an algorithm may be better-suited for a given quan-tum device. One widely used tool is t|ket [57], which contains a variety of architecture-agnostic and architecture-specific routines. For example, Clifford identities such as CNOT 2 = Identity are automatically recognized. There are also a variety of other toolkits for circuit optimization, including hardware-specific packages for quantum circuits [1, 10, 21, 23, 28, 32, 36-38, 41, 43-45, 48, 52, 58, 59, 61, 62, 68]. Since t|ket is a generic framework that contains many algorithms that have already been benchmarked against other procedures, it will serve as our baseline.
We introduce two techniques that can be used to optimize circuits and that are complementary to existing methods. The first focuses on the identification of recurring sets of quantum gates in a circuit. Identifying such recurring sets of gates (RSG) can be very important, since any optimization of these RSGs has an enhanced effect on the overall circuit. Furthermore, identifying recurring gate sets can be useful for future hardware optimizations where the fidelity of certain common operations can be enhanced at the expense of other, less frequent operations. Depending on the operation(s), this optimization could be at the level of microwave pulse controls or it may require custom hardware architectures.
The second technique optimizes a generic circuit by eliminating unnecessary gates or unused qubits such that the circuit depth becomes as short as possible. One example where such an optimization can lead to simplifications is a case where a quantum circuit has been designed with complete generality in mind. In this case, for a certain initial state the circuit only reaches a selected set of intermediate states such that some operations become trivial and can be eliminated. The elimination of unnecessary gate operations introduced here focuses on controlled operations such as a Toffoli or a CNOT gate in a quantum circuit. The heart of the elimination technique resides in the identification of zero-or lowamplitude computational basis states, that allows us to determine whether the entire gate or (part of) qubit controls can be removed. Ref. [38] proposed a similar technique to remove control gates with a quantum state analysis that identifies X-, Y -or Z-basis states. In addition, Ref. [38] accounts for the basis states on target qubits to further simplify the circuit. Our approach fo-  Figure 1: Flowchart of the proposed optimization protocol. The first part is the RSG pattern recognition, in which the circuit is converted into a directed acyclic graph (DAG) to identify recurring quantum gates. In the second part, we eliminate unnecessary gates and unused qubits through a heuristic approach. Finally, the resulting circuit can be encoded into particular gates for specific hardware.
cuses only on Z-basis states on control qubits, but features a unique capability to perform the state determination using polynomial resources with a quantum hardware.
These two techniques are combined in an optimization protocol, called Aqcel (and pronounced "excel") for Advancing Quantum Circuit by icEpp and Lbnl, and are presented in this paper. To demonstrate the effectiveness of the Aqcel protocol, we will use a quantum algorithm from HEP to perform a calculation in Quantum Field Theory. The particular algorithm that we study models a parton shower, which is the collinear final state radiation from energetic charged (under any force) particles [47]. This algorithm is a useful benchmark because it provides an exponential speedup over the most efficient known classical algorithm and the circuit depth can be tuned for precision. While we show results for this specific circuit, the proposed protocol has a wide range of applicability for quantum computing applications across science and industry. This paper is organized as follows. Section 2 provides an overview of the Aqcel protocol. The application of this protocol to the HEP example is presented in Sec. 3. Following a brief discussion in Sec. 4, the paper concludes in Sec. 5.

Aqcel optimization protocol
As already mentioned, the Aqcel protocol comprises two components: identification of recur-  ring quantum gates (Sec. 2.1) and elimination of unnecessary gates and unused qubits (Sec. 2.2). This approach focuses on circuit optimization at the algorithmic level, instead of at the level of a specific implementation using native gates for a particular quantum device. A high-level flowchart for our protocol is presented in Fig. 1.
The individual optimization steps are described below.

Gate set pattern recognition
First, the Aqcel attempts to identify gate set patterns in an arbitrary quantum circuit and extract RSGs from the circuit.

Representation in directed acyclic graph
In a quantum circuit, individual qubits are manipulated sequentially by gate operations, meaning that the quantum state represented at a certain point of the circuit should not be affected by gate operations applied afterward (at a later point in the circuit). Such a structure can be described by a directed acyclic graph (DAG). A DAG allows us to easily check dependencies between qubits and extract a subset of the circuit that functions for certain tasks. First, we convert a quantum circuit to the form of a DAG using the DAGCircuit class in Qiskit Terra API, where a node represents an operation by a quantum gate and an edge that connects the nodes represents a qubit. In the case of a Toffoli gate, the node corresponding to the Toffoli gate has three incoming edges (qubits before the gate operation) and three outgoing edges (qubits after the gate operation). Figure 2 shows an example circuit containing a Toffoli gate and its corresponding DAG.
The gate set pattern recognition can be resolved through the DAG representation. The identity of the RSG functionality can be ensured by checking the identity of DAGs of two circuits, as a graph isomorphism problem. The algorithm of gate set pattern recognition consists of two steps: (1) finding RSG candidates with DAG representation using depth-first search with heuristic pruning, and (2) checking the DAG isomorphism by graph hashing with Weisfeiler Lehman graph hash [56], as implemented in the NetworkX library [25]. The details of the gate set pattern recognition including computational complexity are given in Appendix A, with the pseudocode of the algorithm.

Tiered extraction of recurring gate sets
The appearance pattern of RSGs in a quantum circuit may depend on specific encoding of the quantum algorithm. To account for different patterns, we consider three different levels of matching criteria to define the recurrence of quantum gates: Level 1 : Only matching in gate types, Level 2 : Matching in gate types and the roles of qubits that the gates act on, Level 3 : Matching in gate types and both roles and indices of qubits that the gates act on.
The matching criterion in Level 1 is the least stringent: it just identifies the same sets of quantum gates appearing in the circuit, irrespective of which qubits they act on. The Level 2 is more strict and ensures that the qubits the RSGs act on have the same roles. In other words, the qubit connections between the gates inside a single RSG are maintained but the qubit indices might vary between the RSGs. The Level 3 applies the most stringent condition, where the qubits that the RSGs act on must have the same roles and qubit indices, that is, the RSGs must appear on the identical set of qubits in the circuit. The appearance patterns of the RSGs are illustrated in Fig. 3 for the three matching criteria.
The identified RSGs are ranked in terms of the product of the number of gates constituting the set and the number of occurrence of the set in the circuit. A specified number of top-ranked RSGs are extracted from the circuit in this step.

Heuristic circuit optimization
After attempting to identify RSGs in the circuit, a heuristic optimization procedure takes place to make the circuit depth as short as possible by eliminating redundant gates or unused qubits. In this step, we consider two levels of optimization: where |· ctl denotes the state of the control qubits, while the unlabeled ket corresponds to the rest of the system. We write the states as integers with 0 < j < 2 m − 1 and 0 < k < 2 n−m − 1.
We assume that the controlled operation for the gate is applied when all control qubits are in the |1 state, which corresponds to the state |j ctl = |11 . . . 1 = |2 m − 1 ctl . This allows one to classify the state of the system into three general classes using the amplitudes c j,k : Triggering : c j,k = 0 if and only if j = 2 m − 1.
The controlled operation of the gate in question is applied for all computational bases in the superposition.
Non-triggering : c 2 m −1,k = 0 for all k. The controlled operation is never applied.

Undetermined :
The state is neither triggering nor non-triggering.
A circuit containing triggering or nontriggering controlled gates can be simplified by removing all controls (triggering case) or by eliminating the gates entirely (non-triggering case). While an undetermined single-qubit controlled gate cannot be simplified under the current scheme, an undetermined multi-qubit controlled gate can be by removing the controls on some of the qubits, if the state of the system satisfies the condition described in Appendix B.
As an example of this concept, consider the following simple circuit: If the second qubit is in the initial state |0 , the first CNOT gate has no effect and can be removed from the circuit as the |0 is the non-triggering state of CNOT. The second qubit before the second CNOT gate is in the state |1 , which is the triggering state. Therefore, the qubit control can be removed from the second CNOT gate. The first two qubits before the Toffoli gate are in the superposition of |01 and |11 , which is an undetermined state for the Toffoli gate. Since the Toffoli gate has a triggering bitstring {11}, and the second qubit is always in the |1 state, this second qubit control can be removed from the Toffoli gate, replacing it with a CNOT gate controlled only on the first qubit. The heuristic circuit optimization therefore requires, for each controlled gate, the identification of possible states the control qubits can take, and the removal of unnecessary parts of the controlled operations. These two steps are discussed in detail in the following.
It is well known that an arbitrary multi-qubit controlled-U gate with m control qubits can be decomposed into O(m) Toffoli and controlled-U gates [3]. Therefore, in the remainder of this paper, we assume that all controlled gates are reduced to Toffoli gates denoted as C 2 [X], and singly-controlled unitary operation denoted as C [U ]. This implies that the only triggering bitstrings we need to consider are either {1} or {11}. For a n-qubit circuit composed of N multi-qubit controlled-U gates, each having at most n control qubits, this decomposition results in at most N = nN controlled gates.

Identification of computational basis states
In general, a circuit consisting of n qubits creates a quantum state described by a superposition of all of the 2 n computational basis states. However, it is rather common that a specific circuit produces a quantum state where only a subset of the computational basis states has nonzero amplitudes. Moreover, the number of finite-amplitude basis states depends on the initial state. This is why the three classes of the states of the system arise.
The state classification at each controlled gate can be determined either through a classical simulation or by measuring the control qubits repeatedly. In the case of a classical simulation, one can either perform the full calculation of the amplitudes, or simply track all the computational basis states whose amplitudes may be nonzero at each point of the circuit without the calculation of the amplitudes. Aqcel adopts the latter method in the interest of the lowering the computational re-source requirement. When instead the quantum measurements are used, the circuit is truncated right before the controlled gate in question, and the control qubits are measured repeatedly at the truncation point. Finiteness of the relevant amplitudes can be inferred from the distribution of the obtained bitstrings, albeit within the statistical uncertainty of the measurements.
A few notes should be taken on the computational costs of the two methods. Consider an n-qubit circuit with N controlled gates. As discussed before, reducing this to either More details on the estimates of the computational resource necessary for the identification of computational basis states, as well as other optimization steps, are described in Appendix C.

Elimination of redundant controlled operations
Once the nonzero-amplitude computational basis states are identified at each controlled gate, we remove the gate or its controls if possible. When using classical simulation, the entire circuit is analyzed first before the control elimination step. When quantum measurements are instead used, circuit execution, measurements, and circuit optimization are performed separately at each controlled gate.
The control elimination step for each controlled gate proceeds as follows. For a C[U ] gate, compute the probability of observing |1 of the control qubit. If that probability is 1, eliminate the control and only keep the single unitary gate U . If the probability is 0, remove the controlled gate from the circuit. In all other cases, keep the controlled gate. For a C 2 [X] (Toffoli) gate, compute the probabilities of the four possible states |00 , |01 , |10 , and |11 . If the probability of |11 is 1, remove the two controls and only keep the X gate. If the probability of |11 is 0, remove the entire Toffoli gate. If neither of those two conditions are true (the undetermined class), it is still possible to eliminate one of the two controls. This is true if the probability of the state |01 (|10 ) is zero, in which case one can eliminate the first (second) control. The following pseudocode is the full algorithm for redundant controlled operations removal. Note that for noisy quantum circuits the measurements of the states will not be exact, and one expects errors in the probabilities to observe certain bitstrings. This means that one has to impose thresholds when deciding whether we call the state triggering, non-triggering or undetermined. Once such a threshold has been decided, the number of measurements required has to be large enough for the statistical uncertainty to be smaller than this threshold. This will be discussed in more detail in Sec. 3 when we give explicit examples.
The computational cost of determining whether we can eliminate controls or the entire controlled operation is easily determined. Given the measured bitstrings, which as discussed in the previous section can be determined with O(Ñ 2 M ) operations, one can compute the probabilities for each possible bitstring, and therefore decide whether to simplify a controlled operation using O(Ñ ) operations. Some more details about the resource scaling are given in Appendix C.
Note that superfluous controlled operations can also be found and eliminated using the ZXcalculus [11,17]. In fact, the ZX-calculus is complete in the formal logic sense of the word, such that one can always prove that an unnecessary gate can be removed using the ZX-calculus. However, in general this scheme requires exponential resources, and therefore has no scaling advantage with respect to simply computing the state vectors. Nevertheless, the ZX-calculus is still incredibly powerful and underlies many of the optimization techniques of quantum transpilers, such as the t|ket compiler we compare to later.

Elimination of adjacent gate pairs
Note that if a unitary operator A and its Hermitian conjugate A † act on the same set of qubits adjacently, resulting in an identity operation, the gates implementing these operators can be removed from the circuit. While this is an obvious simplification, the removal of gates through the optimization steps described above can result in a circuit with such canceling gate pairs. For this reason, this step of gate reduction is applied before and after eliminating redundant controlled operations.

Elimination of unused qubits
After taking the above steps, the circuit is examined for qubits where no gate is applied at all. If found, such qubits can be safely removed from the circuit. Such a situation occurs e.g., when a quantum circuit designed to work universally with different initial states is executed using a specific initial state. An example of such a circuit is the sequential algorithm we consider in the next section.

Application to quantum algorithm
The circuit optimization protocol described in Sec. 2 has been deployed to a quantum algorithm designed for HEP [47]. The heuristic optimization (Sec. 2.2) is performed at Level 1 for the optimization on existing quantum hardware. In our results, we present how many gates are removed in three steps of the heuristic optimization, namely: •

Quantum parton shower algorithm
Simulating quantum field theories is a flagship scientific application of quantum computing. It has been shown that a generic scattering process can be efficiently simulated on a quantum computer with polynomial resources [33]. However, such circuits require prohibitive resources in the context of near-term devices. A complementary approach is to simulate one component of the scattering process. In particular, Ref. [47] proposed an algorithm to simulate the collinear radiation from particles that carry a nonzero fundamental charge. Such radiation approximately factorizes from the rest of the scattering amplitude and can therefore be treated independently. This factorization is the basis for parton shower Monte Carlo generators in HEP. The quantum parton shower (QPS) algorithm provides an exponential speedup over known algorithms when the charge is not the same for all particles that can radiate. Figure 4: The m-th step of the quantum circuit for the algorithm proposed in Ref. [47]. There are three physical registers: |p containing the set of particles at this step; |h for the branching history; and |e which is a binary variable representing the presence or absence of an emission at this step. The three lower registers count the number of particles of type φ, a, and b and are uncomputed before the end of the circuit. The exact form of the rotation matrices R (m) and the unitary operations can be found in Ref. [47].
The particular example demonstrated in Ref. [47] starts with n fermions that can be either type f 1 or f 2 . These fermions can radiate a scalar particle φ, which itself can split into a fermion-antifermion pair (of the same or different type). The relevant parameters are the three couplings g 1 , g 2 , and g 12 between f 1 and φ, f 2 and φ, and f 1f2 (f 1 f 2 ) and φ, respectively, where antifermions are denoted by a bar above the fermion symbol f . The shower evolution is discretized into N evol steps and at each step, one of the particles could radiate/split or nothing happens. This produces a precise result when N evol is large. Figure 4 shows the quantum circuit block for the m-th step of the quantum circuit. First, the fermions are rotated into a new basis f a and f b where the effective mixing g ab between f afb (f a f b ) and φ is zero. Then, the number of particles of each type is counted and stored in registers n a , n b , and n φ . Next, a Sudakov factor is calculated to determine if an emission happens or not. This operation depends only on the total number of particles of each type. After the emission step, the particle and history registers are modified depending on the emission. Lastly, the fermions are rotated back into the f 1 and f 2 basis. Some of the steps in this algorithm are universal (independent of m) and some dependent on m due to the running of coupling constants with the energy scale.

Experimental setup
The QPS simulation is implemented into a quantum circuit using IBM Qiskit version 0.21.0 [1] with Terra 0.15.2, Aer 0.6.1 and Ignis 0.4.0 APIs in Python 3.8 [67]. First, we attempt to optimize the circuits running on a classical computer with a single 2.4 GHz Intel core i5 processor.
In order to evaluate the Aqcel performance, the same QPS circuit optimized using t|ket in pytket 0.6.1 before transpilation is used as a reference. The optimization using t|ket is done as follows. We consider the list of ten pre-defined passes 1 . The passes are tried one by one on the QPS circuit, and the one that reduces the number of gates the most is applied to the circuit. The same set of passes are tried again on the resulting circuit to identify and apply the pass that most effectively reduces the gate count. This iterative process is repeated until the gate count is no longer reduced by any of the passes. The selected sequence of passes is used for evaluating the t|ket performance in the remainder of the studies.
The QPS algorithm is executed on the 27qubit IBM's ibmq_sydney device, one of the IBM Quantum Falcon Processors, and the statevector simulator in Qiskit Aer with and without optimizing the circuit. For the results obtained solely from the statevector simulator, all the qubits are assumed to be connected to each other (referred to as the ideal topology). When executing the algorithm on ibmq_sydney, the gates in the circuit are transformed into machinenative single-and two-qubit gates, and the qubits are mapped to the hardware, accounting for the actual qubit connectivity. For all the circuits tested with ibmq_sydney below, the noiseadaptive mapping is performed according to the read-out and CNOT gate errors from the calibration data as well as the qubit connection constraints 2 . Gate cancellations also take place at 1 The following 10 pre-defined passes are considered for the t|ket optimization: EulerAngleReduction(OpType.Rz,OpType.Rx), RemoveRedundancies, GuidedPauliSimp, SquashHQS, FlattenRegisters, Opti-misePhaseGadgets, KAKDecomposition, USquashIBM, CliffordSimp, FullPeepholeOptimise. Two more passes, RebaseIBM, CommuteThroughMultis, are also used once before selecting the pass from the list, which can be found at https://cqcl.github.io/pytket/build/html/ passes.html. 2 This corresponds to the transpilation of level 3 pass this stage using the commutativity of native gates and unitary synthesis, as documented in Qiskit Terra API. This qubit mapping and gate cancellation process are repeated eleven times, and the circuit obtained with the smallest number of gates is finally tested with ibmq_sydney.

Results
3.3.1 Circuit optimization for N evol = 2 branching steps using classical simulation Circuit optimization performance of Aqcel is evaluated for a quantum circuit of the QPS simulation with N evol = 2 branching steps assuming an ideal topology. The simulation does not consider any effects from hardware noise. The initial state is chosen to be |f 1 , and the coupling constants are set to g 1 = 2 and g 2 = g 12 = 1. Both f → f φ and φ → ff processes are considered 3 . The original circuit constructed using Qiskit is shown in Fig. 5.
First, the RSG pattern recognition is performed against the circuit. When the Level 2 RSG pattern recognition is applied, two RSGs are identified, as also shown in Fig. 5, with the requirements on the number of nodes in each RSG being between 5 and 7 and the number of repetitions being 4 or more. If the matching level is raised from Level 2 to 3, candidate patterns with smaller numbers of nodes or repetitions are generally found.
Next, the heuristic optimization (Sec. 2.2) is performed over the entire circuit at Level 1. This step consists of identifying nonzero-amplitude computational basis states, removing redundant controlled operations, removing adjacent canceling gate pairs (performed twice), and removing unused qubits. Nonzero-amplitude computational basis states are identified through classical calculation.
After the algorithmic level circuit optimization, the quantum gates in the circuit are decomposed into single-qubit gates (U 1 , U 2 , U 3 ) and CNOT gates. Figure 6 shows the numbers of the singlequbit and CNOT gates, the sum of the two, and the depth of the circuit before and after the optimization. The circuit depth is defined as the manager, as implemented in Qiskit Terra. length of the longest path from the input to the measurement gates, with each gate counted as a unit, as implemented in Qiskit. The figure compares the values from the original circuit, the circuit optimized with t|ket only, that with Aqcel only, and that with the combination of the two. The Aqcel optimizer reduces the total number of gates by 52%, resulting in a 50% reduction of the circuit depth. In particular, the reduction of the number of CNOT gates is 47%. This compares to t|ket , which reduces the total number of gates by 23%, CNOT by 1%, and the circuit depth by 8%. This means that, for the QPS algorithm, Aqcel is 38% more efficient than t|ket in reducing the gate counts, and 46% more specifically for CNOT, and makes the circuit 45% shorter. Combination of the two optimizers is even more effective; a sequential application of Aqcel and t|ket reduces the gate count by 62% (50% for CNOT only) and the depth by 54% with respect to the original circuit. In other words, the combined optimizer is 51% more efficient than the t|ket alone for gate reduction (49% for CNOT only), producing a 50% shorter circuit.
For the Aqcel optimizer, the gate reduction occurs mostly at the stage where the redundant qubit controls are removed. Starting with 1241 gates (excluding barrier and measurement gates),  the first adjacent gate-pair elimination, the redundant qubit control reduction, and the second gate-pair elimination steps remove 132, 510 (41% of the 1241 gates), and 6 gates, respectively. In terms of the computational cost, the wall time is by far dominated by the two adjacent gate-pair elimination steps combined, accounting for 98% of the total time, followed by a sub-dominant contribution of 1% from the redundant qubit control reduction.
Finally, the number of qubits is reduced from 24 to 21 with the Aqcel optimizer, while it is unchanged by t|ket . One qubit is removed from each of the three registers n a , n b , and n φ because those qubits are used only for N evol ≥ 3 branching steps.
3.3.2 Circuit optimization for N evol = 1 branching step using classical simulation The quantum circuit for the two-branching step QPS simulation is still too deep to produce useful results on a real existing quantum computer, even after optimizing the circuit. Therefore, we consider the circuit with only one branching step using the ibmq_sydney and the statevector simulator. The initial state, coupling constants, and considered processes are the same as those used for the N evol = 2 branching steps simulation.
First, we examine the gate and qubit counts for the one-branching step QPS simulation assuming an ideal topology. Starting with 472 gates, the Aqcel optimizer removes 10, 346 (73% of 472 gates), and 2 gates in the three steps of the heuristic optimization, in the order given above. The adjacent gate-pair elimination step still dominates the wall time (97%). However, the redundant qubit control reduction now takes about 3 times less time than that for the two-branching step simulation, which is consistent with the exponential behavior of the computing cost of the step, as discussed in Sec. 2. The number of qubits is reduced from 15 to 13 with the Aqcel optimizer. One of four ancilla qubits is removed because three ancillas are sufficient for decomposing all the multi-controlled gates in the N evol = 1 step. The register n φ , composed of only one qubit, is also removed because it is used only for the case where the initial state is |φ .
Next, the optimized circuits are transpiled considering the qubit connectivity of ibmq_sydney. Figure 7 shows the same set of distributions as in Fig. 6, but for the one-branching step QPS simulation with ibmq_sydney-specific transpilation. The Aqcel optimizer achieves a significant reduction of native gates for the one branching step as well. The relative reduction is more drastic for the one branching step than the two branching steps, mainly because the former (shallow) cir- cuit has relatively more zero-amplitude computational basis states than the latter (deep) circuit.

Circuit optimization for N evol = 1 branching step using quantum measurements
Now we evaluate the performance of the optimizers using a quantum hardware. A particular challenge when employing Aqcel with a real quantum computer is in the determination of the bitstring probabilities of the control qubits at each controlled gate using quantum measurements. Due to hardware noise, the list of observed bitstrings would contain contributions from errors on the preceding gates and the measurement itself.
To mitigate the measurement errors, we obtain the correction by measuring the calibration matrix for the control qubits (with 8192 shots per measurement) using Qiskit Ignis API. The correction is then applied to the observed distribution with a least-squares fitting approach.
The errors incurred by gate imperfection accumulate throughout the circuit execution and degrade the performance. In particular, the CNOT gate error is the most significant source of the degradation. To mitigate the effects from CNOT errors due to depolarizing noise, we employed a zero-noise extrapolation (ZNE) technique with identity insertions, first proposed in Ref. [16] and generalized in Ref. [30]. The Fixed Identity In-sertion Method of Ref. [30] amplifies the CNOT errors by replacing the i-th CNOT gate in a circuit with 2n i + 1 CNOT gates and extrapolating the measurements to the limit of zero error. In the Aqcel protocol with the QPS simulation circuit, each CNOT gate is replaced with 3 CNOT gates (n i = 1).
To account for remaining contributions to the measurements from gate errors, we opt to ignore the observed bitstrings with occurrence below certain thresholds (called cutoff thresholds). This is justified under the assumption that the residual gate errors act as a perturbation, inserting spurious computational basis states with small amplitudes into the superposition of the system.
In order to choose the cutoff thresholds, we consider errors in the single-qubit gates (U 1,2,3 ) and CNOT gates separately for all the hardware qubits. The reported error rates at the time of the experiment, measured during the preceding calibration run of the hardware, are used for the calculations. Let the U 1,2,3 and CNOT error rates be (i) U and (i,j) CX , respectively, with i and j indicating qubits that the gates act on. We can approximate the probabilities, p U and p CX , of measuring the states without any U 1,2,3 or CNOT gate errors occurring anywhere in the circuit by performing qubit-wise (index-dependent) multiplications of the error rates: where n (i) U and n (i,j) CX are the numbers of U 1,2,3 and CNOT gates acting on the corresponding qubits, respectively. The probability p of measuring the states with at least one gate error occurring anywhere in the circuit is In the last approximation, we have assumed that all CNOT errors are equal, much larger than single gate errors but still much smaller than one: Applying the ZNE to mitigate the depolarizing CNOT errors, the p is reduced to p zne : by ignoring the contributions from single-qubit gate errors. The first cutoff threshold is chosen to be This corresponds to making an extreme assumption that any gate error during circuit execution would result in a specific bitstring observed at the measurement, and attempting to discard that bitstring. The second threshold: where m is the number of the measured control qubits, corresponds to another extreme assumption that the gate errors would result in a uniform distribution of all possible bitstrings. The third and final threshold is the average of the above two: s med := (s low + s high )/2.
It should be noted that p zne increases as the circuit execution proceeds, because p zne accounts for the ZNE-mitigated error rates of all the preceding gates in the circuit. As an alternative strategy to these dynamic cutoff thresholds, we also examine the static thresholds, s f , that are kept constant throughout the circuit, with the values between 0.05 and 0.3. We also consider capping the dynamic thresholds of s low , s med and s high at 0.2, with the reason explained later.
Discarding all bitstrings with occurrence under certain thresholds obviously introduces errors of its own. For example, we observe that discarding bitstrings using the unbounded s high as the threshold for the one-branching step QPS simulation circuit results in an elimination of most of the controlled gates in the later part of the circuit, rendering the circuit practically meaningless. Therefore, the actual cutoff threshold of Aqcel should be selected by considering the tradeoff between the efficiency of the circuit optimization and the accuracy of the optimized circuit 4 .  Figure 8 shows the gate counts obtained from Aqcel optimizations using actual measurements on ibmq_sydney under the dynamic cutoff thresholds. The gate counts decrease as the threshold is raised from s low to s high , as expected. Figure 9 shows the same distributions obtained with the static thresholds. Almost no gate survives under the threshold of 0.3, likely implying a significant loss of accuracy for the computation result.
The number of qubits is reduced from 15 to 13 under all the dynamic thresholds. Under the static thresholds, the number of qubits is reduced from 15 to 13 for 0.05 ≤ s f ≤ 0.2, but a significant reduction to 8 is seen for s f = 0.3.
To evaluate the accuracy of the optimized circuit, we consider a classical fidelity of the final state of the circuit, which is defined in terms of the probability distribution of the bitstrings observed in the measurement at the end of the circuit. This quantity, denoted as F and referred to as just "fidelity" hereafter, is given by where the index k runs over the bitstrings. The quantities p orig k and p opt k are the probabilities of observing k in the original and optimized circuits, respectively.
In fact, we compute two fidelity values for each optimization method. The first, denoted F sim , aims to quantify the amount of modifications to the original circuit introduced by the optimization procedure at the algorithmic level. To calculate F sim , both p orig and p opt are computed using the statevector simulation. The value of F sim = 1 indicates that the optimized circuit is identical to the original circuit (up to a possible phase difference on each of the qubits), while a deviation from unity gives a measure of how much the optimization has modified the circuit. The second fidelity value, F meas , is computed using measurements from actual quantum computer for p opt . The p opt is estimated from the rate at which a bitstring occurs in a large number of repeated measurements. The p orig is computed using simulation as for the F sim . Even if the optimized circuit is identical to the original circuit, the presence of noise will mean F meas < 1, with the difference from unity getting larger when more gates (particularly CNOT gates) are present in the circuit. Removing CNOT gates to obtain the optimized circuit will lower the overall effect of noise and raise the F meas value. However, in some cases the CNOT gate removal would affect low-amplitude computational basis states, making the optimized circuit different from the original circuit, hence suppress the F meas value. Thus, the F meas is a measure that reflects the tradeoff of making the circuit shorter and changing the circuit through optimization. Figure 10 shows the fidelity F meas versus the number of CNOT gates before and after optimization, where the optimization is performed using the classical simulation 5 . One can see that shortening the circuit with less CNOT gates increases the F meas as expected. The F sim values stay at unity for all the optimized circuits (not shown), validating that the optimization does not affect the computational accuracy with respect to the original circuit. The measurements are performed 81,920 times for each of the circuit to obtain the F meas values, and measurement error mitigation is not used in these and the following F meas measurements.
When the elimination of redundant qubit controls is performed based on measurements using a quantum computer with the static thresholds s f , the F meas versus CNOT gate counts become those shown in Fig. 11. Also shown in the figure is the correlation between F sim and F meas . We observe that the F meas increases with increasing s f value up to s f = 0.3. However, the F sim stays close to unity up to s f = 0.2 then decreases significantly, 5 The Fmeas value as a function of the number of all gates including U1,2,3 shows the same trend as that in the Fmeas versus the CNOT gate counts. This confirms that the Fmeas value is predominantly determined by CNOT error contributions to the bitstring probabilities of p opt .  Figure 11: Fidelity F meas versus the number of CNOT gates (top) and fidelities F meas versus F sim (bottom) for the one-branching step QPS circuit transpiled considering ibmq_sydney topology before and after optimization. The probabilities of observing various bitstrings in the control qubits are measured using ibmq_sydney in the heuristic optimization step, and the static thresholds of s f are applied. These transpiled circuits are executed on ibmq_sydney to obtain the F meas and a statevector simulator to obtain the F sim .
signaling that the optimized circuit becomes too far from the original circuit with s f > 0.25. For the circuit considered here, the optimization performance therefore appears to be the best with s f ∼ 0.2. The relations between F meas and gate counts have been compared with and without applying the ZNE for the static thresholds. It shows that the F meas improves with ZNE at low s f thresholds below ∼ 0.15, indicating that the accuracy of the optimized circuit improves by discarding spurious low-amplitude basis states with the suppression of CNOT errors. In Fig. 12 we show the results of the optimization with the dynamic thresholds of s high , s med and s low . The results for the capped variants, where the threshold is capped at 0.2, are also shown. The F meas generally improves with higher thresholds, but the F sim gets significantly worse for all the three thresholds without capping. The capped variants leave more gates in the circuit and have lower F meas than the unbounded cases. However, they can restore the computational accuracy, making the F sim values much closer to unity. An exception is the case of s low where the F meas value is unchanged or slightly better with capping.
The results obtained from different approaches for finding nonzero-amplitude basis states and different choices of cutoff thresholds are summarized in Figs. 13 and 14 for comparison. It is worth noting that most of the Aqcel-based optimization shown in the figure improve the F meas value over the t|ket -only optimization. Another interesting finding is that the determination of bitstring probabilities with quantum measurements brings a better gate reduction than the identification of nonzero amplitudes with classical calculation, if the cutoff threshold is set properly (0.2 for this case). A qualitative explanation for this would be that the quantum measurements and the cutoff serve to remove qubit controls over low-amplitude basis states, where such states contribute little to the final probability distributions. An exact identification of nonzeroamplitude computational basis states with classical simulation does not lead to the removal of such qubit controls. In addition, the determina-tion with quantum measurements can suppress the contributions from spurious low-amplitude states due to the existence of hardware noise, making the F meas value comparable to the one from the determination using classical calculation. Figure 14 shows that, with the proper choice of the thresholds, e.g., s f of 0.2 or s low capped at 0.2, one can make F meas comparable to the case with the optimization performed using classical calculation while keeping F sim at unity.

Applicability of proposed heuristic optimization
The core component of the proposed heuristic circuit optimization is the identification of computational basis states with nonzero amplitudes and the subsequent elimination of redundant controlled operations. Therefore, Aqcel is expected to work more efficiently for quantum algorithms in which the quantum state has a small number of high-amplitude computational basis states. In other words, if all the computational basis states have non-negligible amplitudes, Aqcel would not be effective. An example of when Aqcel is not effective is a quantum algorithm where an equal superposition state is first created by applying H ⊗n to the initial |0 ⊗n state of the n-qubit system, such as Quantum Phase Estimation [50] and Grover's Algorithm.

Possibility of further simplifications
For certain quantum circuits, there is a case where there are successive multi-qubit controlled gates acting with the same control qubits. One example is in the QPS simulation circuit (Fig. 4). The circuit determines if an emission happens and which particle radiates or splits, depending on the total counts of particles of each type. These steps (corresponding to the blocks with controlled unitary operations denoted by U (m) e and U h in Fig. 4) require a lot of successive multi-ple controlled operations that share the same control qubits. In this case, if the circuit is expanded by adding an ancilla qubit and the triggering decision of the control qubits is stored into the ancilla qubit, the remaining multi-qubit controlled gates can be controlled by the ancilla. A potential caveat is that adding ancilla qubits might introduce additional SWAP gates when implementing the circuit to hardware. However, since this approach does not depend on the amplitudes of computational basis states of a given circuit state, it is complementary to the Aqcel optimization scheme and will open the possibility of reducing the overall gate counts further.
Another interesting possibility is that if a circuit turns out to contain only a small number of basis states, the circuit state can be represented using fewer qubits than the original ones. Given that this might require a completely new computational basis, this is left for future work.

Implication to hardware implementations of quantum circuits
The techniques introduced in the Aqcel protocol, i.e., identification of most-frequentlyappearing sets of quantum gates as RSGs and the removal of redundant qubit control operations, have implications to hardware implementation of quantum circuits.
First, the RSGs would be a prioritized target for better mapping to quantum hardware. For the QPS algorithm, the RSGs contain multi-qubit controlled gates like the Toffoli gate, as shown in Fig. 5. In this case, these RSGs are further decomposed into collections of native single-and two-qubit gates. Therefore, the depth of the transpiled circuit depends significantly on which hardware qubits the decomposed RSG gates are mapped on to. If the tranpilation algorithm accounts for the frequency of the occurrence of the RSGs, an improved qubit mapping can be created such that frequently-used controlled gates are applied on neighboring qubits with better connectivities on the quantum hardware.
In comparison between the Aqcel and t|ket optimizers (e.g., Figs. 6 and 7), the t|ket performance on the gate reduction turns out to be suboptimal for the QPS algorithm. This is largely due to the lack of ability in t|ket to remove redundant controlled operations through the identification of nonzero-amplitude computational basis states. However, in certain cases, the t|ketoptimized circuit ends up with even more gates than the original circuit, as seen in Fig. 7 (note that the original and t|ket -optimized circuits are both optimized using the noise-adaptive mapping and gate cancellation, see Sec. 3.2). The t|ket optimizes a circuit assuming that all the qubits are connected to each other. This indicates that the circuit optimized with this assumption could result in more SWAP gates once the hardware connectivity is taken into account 6 . This clearly indicates that it is beneficial for removing unnecessary controlled operations as much as possible without the assumption of full qubit connectivity. Moreover, if a circuit is mainly composed of Level 3 RSGs, as in the case of the QPS circuit used here, the hardware quality of control qubits of the RSGs will become crucial for the circuit simplification procedure in the Aqcel protocol.

Conclusion and outlook
We have proposed a new protocol, called Aqcel, for analyzing quantum circuits to identify recurring sets of gates and remove redundant controlled operations. The heart of the redundant controlled operations removal resides in the identification of zero-or low-amplitude computational basis states. In particular, this procedure 6 This transpilation is performed with tools in Qiskit Terra. The t|ket also allows its own circuit compilation to specific hardware, but it is not examined in this study.
can be performed through measurements using a quantum computer in polynomial time, instead of classical calculation that scales exponentially with the number of qubits. Although removing qubit controls triggered in low-amplitude states will produce a circuit that is functionally distinct from the original one, it is observed that this may be a desirable feature in some cases under the existence of hardware noise. If a quantum circuit contains recurring sets of quantum gates, those gates will be considered as candidates for further optimization in terms of both gate synthesis and hardware implementation. In the proposed protocol, the underlying technique to identify recurring gate sets is demonstrated, leading to the possibility of hardware-aware optimization of such gates including dedicated microwave pulse controls.
We have explored the Aqcel optimization scheme using the quantum parton shower simulation, a prototypical quantum algorithm for high-energy physics. For this algorithm, the proposed scheme shows a significant reduction in gate counts with respect to t|ket , which is one of the industry-standard optimization tools, while retaining the accuracy of the probability distributions of the final state.
This feature opens the possibilities to extend this optimization scheme further in future. We have considered several scenarios of the thresholds applied to the measured bitstrings to take into account the gate errors. The measurement error is accounted for using the calibration matrix approach, and this can be improved by adapting the unfolding technique developed in Ref. [4] and related approaches that use fewer resources [20,22,27,60,70] or further mitigate the errors [31]. A substantial contribution to the gate errors originates from CNOT gates. There are a variety of approaches to mitigate these errors, including the zero noise extrapolation mentioned in Sec. 1. The method based on the fixed identify insertion technique has been tested, showing that the circuit optimization improves with lower thresholds to determine the bitstring probabilities. The random identity insertion protocol introduced in Ref. [30] may further reduce the gate count and thus improve the fidelity of our approach. The threshold choice has a large impact to the accuracy of measuring the probability distributions, as in Fig. 14, therefore the precise con-trol of the measurement and gate errors is crucial for this approach.

Acknowledgements
We acknowledge the use of IBM Quantum Services for this work. The views expressed are those of the authors, and do not reflect the official policy or position of IBM or the IBM Quantum team.
CWB and BN are supported by the U.S. Department of Energy, Office of Science under contract DE-AC02-05CH11231. In particular, support comes from Quantum Information Science Enabled Discovery (QuantISED) for High Energy Physics (KA2401032).
We would like to thank Ross Duncan and Bert de Jong for useful discussions about the ZXcalculus.

A Algorithms of graph pattern recognition
The pattern recognition algorithm of recurring set of quantum gates (RSG) is described in Algorithm 2. This algorithm is based on depth-first search with heuristic pruning.
First, RSG candidates are built from seeding a quantum gate (node) by seeking possible combinations of RSGs that have descending connected quantum gates. A target node used as a seed, i.e., the beginning node, is selected with postorder traversal with a memorization technique to avoid a repeating calculation. The computational complexity of the algorithm is O(N nodes !) 7 . Due to a large number of combinations of recurring gates, the complexity is worse than the typical complexity of a classical computer, O(n qubits !) or O(2 n qubits ), because of N nodes = n gates ≥ n qubits in most cases, and therefore it loses the benefit of quantum computer. To reduce the computational complexity, we prune the RSG candidates by requiring the length of the longest path, the minimum number and the maximum number of elements in RSG. The requirement of the minimum number of elements rejects a trivial RSG (e.g. G = {X}). The computational 7 The i-th node has N nodes − i RSG candidates in the worst case. Therefore, the computational complexity of the combination of the number of child-node's RSG candidates is O(N nodes !). complexity reduces to O(N N thr nodes ) 8 where N thr is a threshold value for the pruning, and the classical computer can calculate this in polynomial time when N thr is fixed. However, this algorithm sometimes causes ill-defined RSGs, as shown in Fig. 15. The functionality of the quantum circuit from such an RSG depends on the intermediate gate that is not used in the RSG. These RSGs are rejected in this algorithm by requiring that there is no node, which is both a child and a parent nodes but not an element of the After building the RSG candidates, they are grouped by graph isomorphism using the Weisfeiler Lehman graph hash. The use of graph hash does not ensure that two graphs are isomorphic, but the accuracy is sufficient for our use case. For the Level 1 matching criteria which consider only gate types, we assign the gate type as a node feature and assign nothing for an edge feature. For the Level 2 matching criteria which consider both gate types and qubit roles, we assign the gate type as a node feature and assign the target or control label as an edge feature. For the Level 3 matching criteria which consider gate types, qubit roles and indices, we assign the gate type as a node feature and assign the absolute qubit index as an edge feature.
Finally, the top-k RSGs are selected based on the frequency times the graph size.
Algorithm 2: Gate set pattern recognition with DAG for all quantum gate (node) (g i ) in the circuit (G) do for all subset (G ) beginning with the target node (g i ) do 8 We take N thr RSG candidates from N nodes nodes. Therefore, the computational complexity is N nodes if the longest path is longer than the threshold then continue end if if number of elements in subset is out of thresholds then

B General conditions to eliminate qubit controls
Given a multi-qubit controlled gate C m [U ] and a system in the "undetermined" state |ψ following the classification in Section 2.2.1, we can derive the condition for removal of a part of the controls to be allowed in the following way.
Let x be the number of controls to be removed. Without loss of generality, the decomposition of |ψ can be rewritten as |ψ = i,l,kc i,l,k |i ctl ⊗ |l free ⊗ |k , where |· ctl and |· free are the states of the m − x remaining control qubits and the x qubits from which the controls are removed. From Eq. (1), and thereforec i,l,k = c 2 x i+l,k .
Applying the original controlled gate to |ψ yields  (13) where ket subscripts and the tensor product symbols are omitted for simplicity. In contrast, the new gate with fewer controls gives l,kc i,l,k |i |l |k + l,kc For the removal of x qubit controls to be allowed, the right hand sides of Eqs. (13) and (14) must be identical. This requires Since the cost of exactly computing the complex amplitudes of the quantum state is high, in Aqcel we only consider this second condition: c 2 m−x −1,l,k = 0 ∀l ∈ {0, 1, ..., 2 x − 2}, k. (19) When using quantum measurements to estimate the bitstring probabilities at the control qubits, this requirement corresponds to observing no bitstring with 1 in all control qubits, except when l = 2 x − 1. In other words, there should be no bitstring by which C m [U ] is not triggered but C m−x [U ] is.

C Computational resources for the proposed optimization scheme
The computational cost needed to perform the proposed optimization scheme is evaluated here. We consider a quantum circuit that contains n qubits and N multi-qubit controlled gates, each acting on m control qubits and one target qubit.
The elimination of adjacent gate pairs proceeds, for each gate, by checking a pair-wise matching to the next gate until the end of the gate sequence. Since the gate can act on at most n qubits, the computational cost is O(nN ).
The next step in the optimization scheme is the identification of computational basis states. If we use the classical calculation for simply tracking all the computational basis states whose amplitudes may be nonzero at each point of the circuit without the calculation of the amplitudes, it requires the computation of O(N 2 n ) states, so the resource requirement grows exponentially with n. This method requires less computational resource than a statevector simulation but it neglects certain rare cases where exact combinations of amplitudes lead to the elimination of redundant controlled operations. If we measure the control qubits at each controlled gate M times using a quantum computer, the total number of gate operations and measurements is given by M {m + (1 + m) + (2 + m) + · · · + (N − 1 + m)} = 1 2 M N (N − 1) + mM N . Therefore, the computational cost grows as O(M N 2 + mM N ), i.e., polynomially with n.
We next consider removing redundant qubit controls from a controlled gate with m control qubits. Using a quantum computer that measures the m control qubits M times, the measured number of bitstrings is M if M < 2 m , otherwise 2 m . For the classical calculation, the number of basis states is 2 m . Imagine that we choose an arbitrary combination among 2 m possible combinations of new qubit controls on the same controlled gate. If we want to know whether the chosen combination can act as the correct qubit control, we need to check, for a given measurement done previously with a quantum computer, if all measured bitstrings satisfy Eq. (19). This requires O(m2 m ) checks for one bitstring. Since this has to be checked for all the measurements, the cost is O(M m2 m ) if M < 2 m , otherwise O(m4 m ) for one chosen combination. Therefore, the overall com-putational cost for the determination of redundant qubit controls is O(M m4 m N ) or O(m8 m N ) for N multi-qubit controlled gates, each having 2 m combinations of new qubit controls. The classical calculation requires O(m8 m N ) as well.
It is known that an arbitrary multi-qubit controlled-U gate with m control qubits can be decomposed into O(m) Toffoli and two-qubit controlled-U gates [3]. Therefore, if a controlled gate in the circuit is decomposed in this way, then above computational cost for the redundant qubit controls would become O(mN ). With this decomposition, the total number of gate operations and measurement increases due to O(m) times more controlled gates. However, the computational cost for the identification of computational basis states becomes only 1 2 mM N (mN − 1) + 2mM N , so it still behaves polynomially as O(m 2 M N 2 ) when quantum computer is used. For the classical calculation, the cost becomes O(mN 2 n ).
The final step of the optimization scheme is the elimination of unused qubits. This is performed by simply checking qubits that all the gates in the circuit act on, corresponding to a computational cost of O(nN ).
Given that a controlled gate has at most n − 1 control qubits, the total computational cost for the entire optimization sequence is O(n 2 M N 2 ) or O(nN 2 n ), depending on whether the computational basis state measurement is performed using a quantum computer or a classical calculation.