Advanced Cryptanalysis Techniques for Reduced-Round SHA-256

Abstract

This document explores advanced cryptanalysis techniques applied to reduced-round variants of the SHA-256 hash function. We demonstrate the application of a suite of methods—including differential cryptanalysis with neutral bits, multi-block strategies, boomerang attacks, meet-in-the-middle approaches, higher-order and second-order differentials, GPU acceleration, and quantum-inspired concepts—to find collisions for SHA-256 reduced to 24 rounds. While these techniques significantly reduce the complexity compared to brute force for these reduced versions, they do not threaten the security of the full 64-round SHA-256. Additionally, we present a methodology for full hash reconstruction from internal states of late rounds (46–64), assuming a known input block structure (e.g., all zeros), illustrating the invertibility properties of the SHA-256 compression function under specific conditions. Drawing inspiration from prior work on SHA-0 and SHA-1 [Ref 5, 6], this study adapts established cryptanalytic techniques to SHA-256, providing detailed explanations, optimized Python implementations, and an Appendix defining differential characteristics and constraint equations with illustrative examples. The goal is to offer a practical toolkit and educational resource for understanding the structural properties and potential vulnerabilities of SHA-256 when its round count is reduced, bridging theoretical concepts with hands-on examples.

Keywords: SHA-256, collision attack, cryptanalysis, hash reconstruction, reduced rounds, differential cryptanalysis, multi-block, neutral bits, boomerang attack, meet-in-the-middle, GPU acceleration, quantum cryptography, second-order differential, SHA-2 security, cryptography research, Python cryptography, differential characteristic, constraint equations, example characteristic table

2. Methodology

This section details the methodologies used for the 24-round collision attack demonstrations and the reconstruction from rounds 46–64. The collision attacks employ a range of techniques, aiming to illustrate different approaches to finding colliding pairs for the reduced-round compression function.

Diagram Placeholder: Detailed SHA-256 Round Function. Show inputs (Ai..Hi, Wi, Ki), internal calculations (Σ0, Σ1, Ch, Maj, T1, T2, additions), and outputs (Ai+1..Hi+1). Clearly label the data flow.

2.1 Advanced Collision Attack Techniques for 24-Round SHA-256

The goal is to find two distinct 16-word (512-bit) message blocks, M and M*, such that when processed with a given Initial Vector (IV) through 24 rounds of the SHA-256 compression function, they yield the same output state. `Compress(IV, M) = Compress(IV, M*)`. We explore several techniques:

2.1.1 Differential Cryptanalysis with Neutral Bits

This technique relies on finding a high-probability differential characteristic – a specific input difference (ΔM = M ⊕ M*) that propagates through the rounds to a zero output difference (ΔState₂₄ = 0) with a predictable (though typically low) probability. The core steps are: (See Appendix A for a formal definition of characteristics Ω, constraint equations G, and illustrative examples).

Differential Path: A specific sequence of state differences (ΔA, ..., ΔH) and message word differences (ΔW) is chosen. Table 1 shows an example path segment. Finding a path with the highest possible probability over 24 rounds is key. The overall probability `p` is the product of probabilities at each round where non-linear functions (Ch, Maj, addition modulo 2³²) are involved.
Neutral Bits: Certain bits in the input message words (W[0]–W[15]) can be flipped (individually or together) without affecting the differential path's propagation for a certain number of initial rounds. These are "neutral bits" [Ref 10]. By searching over these bits, we can satisfy the conditions imposed by the differential path (the set G of constraint equations) in later, probabilistic rounds more efficiently.
Adaptive Neutral Bits & Search: Instead of fixing all neutral bits initially, some can be adjusted based on intermediate state differences observed during the computation (e.g., after round 14). This adaptive approach aims to correct deviations from the desired path. The search involves trying different combinations of neutral bits until a pair (M, M*) conforms to the entire 24-round characteristic.
Complexity: Time complexity is roughly O(1/p), where p is the characteristic's probability. Neutral bits (k bits) allow testing 2^k message pairs potentially satisfying the early rounds, aiming to find one that satisfies the full path. Effective complexity might be closer to O(2^c / p), where 'c' represents the number of constraints not covered by neutral bits. Memory complexity is typically low, O(1).
Early Abort: If the state differences deviate too much from the target path early on (e.g., at round 10), the computation for that candidate pair is aborted, saving computational effort.
Multi-Pass Correction: Fine-tuning message words (e.g., W[17]–W[19], derived via the message schedule) based on state differences observed near the end (rounds 21–23) to precisely cancel out remaining differences.

2.1.2 Multi-Block Strategy

Inspired by attacks on SHA-1 [Ref 5, Ch. 7], this approach uses two (or more) message blocks to construct a collision:

First Block (Near-Collision): Find a pair (M1, M1*) such that `Compress(IV, M1)` and `Compress(IV, M1*)` result in intermediate hash values (H1, H1*) that differ by a specific, carefully chosen ΔH1 (a near-collision). Let the probability of the differential path for this block be p1.
Second Block (Correction): Find a second message block M2 (or a pair M2, M2*) such that `Compress(H1, M2)` and `Compress(H1*, M2)` produce the same final hash value. The differential path used in this second block (probability p2) is specifically designed to cancel out the difference ΔH1 introduced by the first block.
Complexity: The overall time complexity depends on the complexities of finding the near-collision and the correction, roughly related to O(1/p1 + 1/p2) or O(max(1/p1, 1/p2)) depending on the attack structure. Memory is typically O(1). Can be advantageous if finding two shorter, higher-probability paths is easier than one long low-probability path.

2.1.3 Boomerang Attack

The boomerang attack [Ref 3] splits the 24 rounds into two sub-ciphers (e.g., rounds 0–11 (E1) and 12–23 (E2)). It uses two independent differential characteristics:

Upper Path (E1): Rounds 0–11, with input difference α leading to output difference β with probability p.
Lower Path (E2): Rounds 12–23, with input difference γ leading to output difference δ with probability q.
Connection (Boomerang Quartet): By constructing four specific inputs (M, M', M*, M'*) related by differences α and γ at appropriate points, one can achieve a collision (output difference 0) after 24 rounds with probability related to p²q². This requires finding quartets satisfying conditions at the midpoint.
Complexity: Time complexity is roughly proportional to O(1/(p*q)) assuming the intermediate conditions for the boomerang connection hold well. Memory is typically O(1). Can be more efficient than a direct differential attack if p and q are significantly higher than the probability of a single 24-round path.

Diagram Placeholder: Boomerang Attack structure. Show four computations (P -> P', P* -> P'*) starting with differences α and γ. Show E1 (α -> β, prob p) and E2 (γ -> δ, prob q) paths. Indicate the intermediate differences β and δ, and how they ideally cancel out at the end.

2.1.4 Higher-Order Differentials

Instead of tracking the propagation of a first-order difference (ΔX = X ⊕ X*), this technique [Ref 4] analyzes the propagation of second-order (or higher) differences (Δ²X = X ⊕ X* ⊕ X'' ⊕ X''' ).

Exploiting Non-linearity: Higher-order differentials can sometimes have predictable propagation properties (e.g., summing to zero) even when first-order differences behave complexly, potentially bypassing difficulties caused by the algorithm's non-linear components over certain rounds.
Complexity: Often involves finding specific input patterns satisfying the higher-order differential conditions, which can be complex. The practical efficiency depends heavily on the specific properties of the target function and the order of the differential. Complexity is hard to generalize but often involves solving systems of equations. Memory usage varies.

2.1.5 Meet-in-the-Middle (MitM)

This classic technique splits the computation into forward and backward parts:

Forward Computation: Compute states from round 0 up to an intermediate round (e.g., round 12) for many different starting messages or message modifications. Store the resulting intermediate states (State₁₂) and the corresponding inputs/modifications in a table.
Backward Computation: Starting from a target final state (e.g., aiming for a collision, so the target difference is zero), compute backward from round 23 to the intermediate round (round 12) for many message modifications. This requires inverting the round function, which is possible but potentially complex.
Matching: Look for matches between the states computed in the forward and backward passes at the intermediate round (round 12). A match indicates a potential full path.
Complexity: For N rounds split at N/2, the time complexity is roughly O(2^N/2) computations (performing 2^N/2 forward and 2^N/2 backward steps). Requires O(2^N/2) memory to store intermediate states from the forward pass. This represents a significant time-memory trade-off. For 24 rounds, this is O(2¹²) time and memory.

Diagram Placeholder: Meet-in-the-Middle Attack structure. Show computation starting from IV (forward) to Round N/2, storing states in a table. Show computation starting from target output (backward, inverting rounds) to Round N/2. Indicate the matching process at Round N/2.

2.1.6 GPU Acceleration

Many cryptanalytic searches are highly parallelizable. GPUs, with thousands of cores, can significantly accelerate the search process:

Kernel Design: A GPU kernel (a function running on the GPU) can be designed to perform one or a few SHA-256 compression attempts (e.g., testing one set of neutral bits).
Parallel Execution: Launch thousands or millions of these kernels concurrently, exploring a vast portion of the search space much faster than a CPU.
Optimization: Requires careful memory management (global vs. shared memory) and workload balancing to achieve optimal speedup.
Complexity Impact: Does not change the theoretical complexity (e.g., O(1/p) or O(2^k)) but drastically reduces the wall-clock time by a large constant factor (e.g., 10x-1000x depending on the algorithm, implementation, and hardware), making previously infeasible searches practical. Memory usage depends on the kernel design.

2.1.7 Quantum-Inspired Optimization (Conceptual)

While practical large-scale quantum computers are not yet available, Grover's algorithm offers a theoretical quadratic speedup for unstructured search problems.

Grover's Algorithm: If checking a potential solution takes T time, Grover's algorithm could theoretically find a valid solution in O(√N * T) time on a quantum computer, where N is the size of the search space (e.g., N = 2^k for a k-bit search).
Oracle Function: Requires defining a quantum "oracle" that can recognize a correct collision (e.g., checks if Hash(M) == Hash(M*)).
Classical Simulation: The Python code simulates the idea by reducing the iteration count, but this is purely illustrative and does not achieve a real quantum speedup on classical hardware. It serves as a placeholder for potential future quantum cryptanalysis.
Theoretical Complexity: For a search space requiring N classical checks (e.g., N ≈ 1/p for differential), the theoretical quantum time complexity would be O(√N). Memory requirements depend on the specific quantum algorithm implementation.

2.1.8 Second-Order Differential Technique (Refinement)

This refines differential attacks by considering not just the first-order difference (ΔM) but also second-order properties related to how differences in the message affect differences in internal state variables [Ref 5, Ch. 9].

Enhanced Message Selection: By imposing conditions based on second-order differentials, one can potentially increase the probability that a chosen message pair will conform to the desired first-order differential characteristic, especially through the non-linear functions.
Equations Set: These second-order conditions can sometimes be expressed as a system of linear or non-linear equations over GF(2), allowing for more structured ways to find suitable message modifications. (Related to the set G described in Appendix A).
Complexity: Can potentially reduce the search complexity compared to basic differential attacks if the second-order conditions effectively prune the search space, but the analysis and setup are more complex.
Integration: Often combined with neutral bits; neutral bits control early rounds, while second-order techniques help satisfy conditions in later, more complex rounds.

2.1.9 Common Optimizations Across Techniques

These software optimizations improve the performance of the underlying SHA-256 computations:

Bit-Slicing: Performing operations on multiple states simultaneously by packing corresponding bits from different states into single machine words (e.g., 32 or 64 states processed in parallel using bitwise operations), simulating SIMD.
Cache Optimization: Arranging data structures (like state variables and message schedules) in memory to maximize cache locality, reducing time spent waiting for data from main memory.
Precomputed Tables: Storing the results of SHA-256's non-linear functions (Ch, Maj, Sigma0, Sigma1) for common inputs in lookup tables. This trades memory usage for faster computation per round, but table sizes can become large (e.g., 2³² entries for full 32-bit inputs is infeasible, so often used for smaller input ranges or byte-wise components).
Multi-Threading: Dividing the search space (e.g., different neutral bit combinations) across multiple CPU cores to achieve parallel speedup on modern processors.

Table 1: Example Differential Path Segment for 24-Round Collision Attempt (Illustrative)
Round (i)	Input ΔW[i] (XOR)	Output State Δ (XOR) after Round i (Conceptual)	Probability Factor	Notes
0–11	0 (mostly)	(ΔA≈0, ..., ΔH≈0)	≈1	Controlled using message freedom / neutral bits. Small differences might exist.
12	e.g., 00008000	(ΔA=..., ΔB=..., ..., ΔE=..., ...)	P₁₂ < 1	Difference introduced via ΔW[12]. Propagation depends on Ch, Maj, Add values.
...	...	...	...	Difference propagates, probability decreases with each probabilistic step.
17	e.g., ΔW[17] derived from ΔW[1], ΔW[2], ΔW[10] via schedule	(ΔA=..., ..., ΔH=...)	P₁₇ < 1	Message schedule propagation interacts with state diff. Conditions apply.
...	...	...	...	Corrections might be needed via W modifications (neutral bits).
23	ΔW[23]	(ΔA=0, ΔB=0, ..., ΔH=0)	P₂₃ < 1	Target: Zero difference in state before final DM add. Conditions apply.
Total (0-23)	ΔM (Implied by ΔW[0..15])	(ΔA=0, ..., ΔH=0)	P_total = Π P_i ≪ 1	*Goal: Find M, M s.t. M⊕M=ΔM and path is followed by satisfying constraints G.*

2.2 Reconstruction from Rounds 46–64

This methodology demonstrates reversing the SHA-256 compression function from known internal states of the final rounds back to the initial state (IV) and the input message block, under specific assumptions.

Assumption: We assume the internal states (A, B, C, D, E, F, G, H) are known for rounds 46 through 64 (specifically, the state *after* each of these rounds is completed), and crucially, that these states resulted from processing a known block structure, such as an all-zero block (W[0]..W[15]=0) with standard padding applied, using the standard SHA-256 IV.
Reversal Logic: The core idea is to invert the round update equations. Knowing the state (A_i+1..H_i+1) after round `i`, we can calculate the state (A_i..H_i) before round `i`.
- A_i+1 = T1 + T2
- E_i+1 = D_i + T1
- Where T1 = H_i + Σ1(E_i) + Ch(E_i, F_i, G_i) + K_i + W_i
- And T2 = Σ0(A_i) + Maj(A_i, B_i, C_i)
- And (A_i+1..H_i+1) relates to (A_i..H_i) via shifts: B_i+1=A_i, C_i+1=B_i, D_i+1=C_i, F_i+1=E_i, G_i+1=F_i, H_i+1=G_i.
By rearranging these, knowing the state at round `i+1`, K_i, and W_i (which we also determine), we can solve for the state at round `i`.
Message Schedule Inversion: To find W_i for i < 16, we need to reverse the message schedule expansion: W[i] = W[i+16] - σ1(W[i+14]) - W[i+9] - σ0(W[i+1]). This requires knowing W values from later rounds. The reconstruction starts from round 63 backwards. W[i] for i=0..63 can be derived during the backward state computation by isolating W[i] in the T1 calculation. For i < 16, the derived W[i] must match the assumed input block structure (e.g., zero).
Full State Propagation: Using the known states from rounds 46–64 provides sufficient overlap to uniquely determine the earlier states and message words back to round 0 under the stated assumptions.
Consistency Checks: The reconstructed W[0]–W[15] must match the assumed input block (e.g., all zeros). The reconstructed state at round 0 (A₀..H₀) must match the IV used. This validates the process.
Limitation: Reconstructing an unknown IV and unknown arbitrary input block from only late-round states is significantly harder due to the non-linearities and the mixing in the message schedule. It becomes an underdetermined system of complex equations, potentially requiring techniques like SAT solvers or algebraic cryptanalysis, with high computational cost.

Table 2: Example States from Rounds 46–64 (Resulting from Standard IV and All-Zero Input Block) - *Placeholder Data*
Round	A	B	C	D	E	F	G	H
46 (State after)	C366490E	xxxxxxxx	xxxxxxxx	xxxxxxxx	8F8AD208	xxxxxxxx	xxxxxxxx	xxxxxxxx
...	...	...	...	...	...	...	...	...
63 (State after)	xxxxxxxx	xxxxxxxx	xxxxxxxx	xxxxxxxx	xxxxxxxx	xxxxxxxx	xxxxxxxx	xxxxxxxx
64 (State before final add)	A₆₄	B₆₄	C₆₄	D₆₄	E₆₄	F₆₄	G₆₄	H₆₄

The reconstruction demonstrates the deterministic nature of SHA-256's backward computation given sufficient known state and input structure, providing a tool for internal analysis.

3. Implementation

This section provides Python code examples illustrating the concepts discussed. These implementations prioritize clarity and demonstrate the core logic of each technique. They are optimized for readability but include notes on further performance enhancements.

Note: For brevity, some helper functions or full SHA-256 implementations might be assumed or simplified. Error handling is minimal.

3.1 Core SHA-256 Functions and Precomputation

Fundamental SHA-256 operations. Precomputation tables can speed up calculations but require memory.


import sys
from threading import Thread
from queue import Queue
import random # Used for random generation where needed
import math # Used for sqrt in quantum sim
import traceback # For printing exceptions

# --- Core SHA-256 Bitwise Operations ---
# (Ensure these are correct as per FIPS 180-4)
def Sigma0(x): return ((x >> 2) | (x << 30)) ^ ((x >> 13) | (x << 19)) ^ ((x >> 22) | (x << 10)) & 0xFFFFFFFF
def Sigma1(x): return ((x >> 6) | (x << 26)) ^ ((x >> 11) | (x << 21)) ^ ((x >> 25) | (x << 7)) & 0xFFFFFFFF
def sigma0(x): return ((x >> 7) | (x << 25)) ^ ((x >> 18) | (x << 14)) ^ (x >> 3) & 0xFFFFFFFF
def sigma1(x): return ((x >> 17) | (x << 15)) ^ ((x >> 19) | (x << 13)) ^ (x >> 10) & 0xFFFFFFFF
def Ch(x, y, z): return ((x & y) ^ (~x & z)) & 0xFFFFFFFF
def Maj(x, y, z): return ((x & y) ^ (x & z) ^ (y & z)) & 0xFFFFFFFF
def add32(*args): # Ensure 32-bit addition
    res = 0
    for x in args:
        res = (res + x) & 0xFFFFFFFF
    return res

# --- SHA-256 Constants ---
K = [0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
     0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
     0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
     0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
     0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
     0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
     0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
     0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2]

IV_STD = [0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A, 0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19]

# --- Precomputation (Example - Limited Scope for Practicality) ---
# Precomputing for full 32-bit inputs is infeasible (4GB per table).
# Often done for byte-wise components or smaller fixed ranges if applicable.
# For this example, we'll skip large tables, but note their potential.
# SIGMA0_TABLE = {x: Sigma0(x) for x in range(2**8)} # Example: Only for 8-bit inputs
# CH_TABLE = ... # Similarly limited or byte-wise

# --- SHA-256 Compression Function (Helper) ---
def sha256_compress(state_in, W_block, num_rounds=64):
    """Computes SHA-256 compression for num_rounds."""
    if len(W_block) < 16:
        raise ValueError("W_block must have at least 16 words")

    W = list(W_block[:16]) # Take the first 16 words
    # Expand message schedule only up to num_rounds needed
    max_w_idx = max(16, num_rounds) # Determine max index needed for W
    for i in range(16, max_w_idx):
        # Check if index exists before accessing
        # Indices required: i-16, i-15, i-7, i-2
        if (i-16) < 0 or (i-15) < 0 or (i-7) < 0 or (i-2) < 0:
             # Should not happen for i >= 16, but good practice
             raise IndexError(f"Invalid index access during W expansion at i={i}")

        s0 = sigma0(W[i-15])
        s1 = sigma1(W[i-2])
        w7 = W[i-7]
        w16 = W[i-16]
        W.append(add32(w16, s0, w7, s1))

    a, b, c, d, e, f, g, h = state_in
    round_states = [(a, b, c, d, e, f, g, h)] # Store initial state (state BEFORE round 0)

    for i in range(num_rounds):
        # Ensure W[i] exists; needed up to num_rounds-1
        if i >= len(W):
             # This indicates W expansion didn't go far enough
             raise IndexError(f"W[{i}] accessed but not generated (max W index {len(W)-1}, needed {num_rounds-1})")

        T1 = add32(h, Sigma1(e), Ch(e, f, g), K[i], W[i])
        T2 = add32(Sigma0(a), Maj(a, b, c))
        h = g
        g = f
        f = e
        e = add32(d, T1)
        d = c
        c = b
        b = a
        a = add32(T1, T2)
        round_states.append((a, b, c, d, e, f, g, h)) # Store state AFTER round i computation

    # The DAVIES-MEYER step: Add input state to the final state
    state_out = [
        add32(state_in[0], a), add32(state_in[1], b), add32(state_in[2], c), add32(state_in[3], d),
        add32(state_in[4], e), add32(state_in[5], f), add32(state_in[6], g), add32(state_in[7], h)
    ]
    # Return the final hash output AND the list of all intermediate states (including state before final add)
    # round_states[0] = state before round 0 (IV)
    # round_states[1] = state after round 0
    # ...
    # round_states[num_rounds] = state after round num_rounds-1 (state before final DM add)
    return state_out, round_states

These functions form the basis for the attack implementations.

3.2 Optimized Differential Collision Attack with Neutral Bits (24 Rounds)

This code demonstrates finding a collision for 24 rounds using a differential path and neutral bits. The specific path and neutral bit positions are crucial and derived from detailed analysis (not fully shown here). (See Appendix A for background).


# --- Differential Attack Configuration ---
NUM_ROUNDS_DIFFERENTIAL = 24
# Increased attempts significantly, but still likely insufficient for guaranteed collision
# without a known high-probability characteristic. Real attacks might need >> 2^40 attempts.
MAX_ATTEMPTS_DIFFERENTIAL = 2**20 # Example: ~1 Million attempts
NUM_THREADS_DIFFERENTIAL = 8 # Adjust based on CPU cores
EARLY_ABORT_ROUND = 10
# EARLY_ABORT_THRESHOLD = 0x3FF # Example threshold for state difference deviation (not implemented in check_differential_path below)

# Example Neutral Bit configuration (positions: (word_index, bit_index))
# These would be carefully chosen based on the differential path analysis.
# A larger set of neutral bits allows exploring more message modifications.
NEUTRAL_BITS_CONFIG = [(0, 31), (1, 30), (2, 29), (3, 31), (4, 30), (5, 29), (6, 28), (7, 27),
                       (8, 31), (9, 30), (10, 29), (11, 28), (12, 27), (2, 30), (3, 29), (4, 28),
                       (0, 15), (1, 14), (2, 13), (3, 12)] # Example: 20 neutral bits

def apply_neutral_bits_differential(W_base, attempt_num):
    """Applies neutral bit flips based on the attempt number."""
    W_modified = list(W_base)
    num_neutral_bits = len(NEUTRAL_BITS_CONFIG)
    for i in range(num_neutral_bits):
        if (attempt_num >> i) & 1: # Check the i-th bit of the attempt number
            word_idx, bit_pos = NEUTRAL_BITS_CONFIG[i]
            if word_idx < len(W_modified):
                W_modified[word_idx] ^= (1 << bit_pos)
    return W_modified

def check_differential_path(states_M, states_M_star, target_diff_func):
    """Checks if the state differences roughly follow a target path (simplified placeholder)."""
    # A real implementation would compare XOR differences (state_M[r] ^ state_M_star[r])
    # against the expected differences from the chosen characteristic Ω at each round r.
    # It would return False if deviations exceed thresholds, especially at check rounds.
    # Example check for early abort (using state AFTER round EARLY_ABORT_ROUND-1):
    if EARLY_ABORT_ROUND < len(states_M) and EARLY_ABORT_ROUND < len(states_M_star):
         state_m_abort = states_M[EARLY_ABORT_ROUND]
         state_m_star_abort = states_M_star[EARLY_ABORT_ROUND]
         # Example: Check if difference in A register deviates too much (simplified check)
         # diff_A = state_m_abort[0] ^ state_m_star_abort[0]
         # expected_diff_A = get_expected_diff_from_characteristic(EARLY_ABORT_ROUND, 'A') # Hypothetical function
         # if diff_A != expected_diff_A: return False
         pass # Placeholder for actual path conformance and early abort logic

    # Placeholder: Assume path is followed if no early abort triggered
    return True

def differential_worker(start_attempt, end_attempt, IV, M_base, M_star_base, result_queue):
    """Worker thread for differential collision search."""
    # print(f"Thread searching attempts {start_attempt} to {end_attempt-1}") # Verbose
    for attempt in range(start_attempt, end_attempt):
        # Apply neutral bits based on the current attempt number
        W = apply_neutral_bits_differential(M_base, attempt)
        W_star = apply_neutral_bits_differential(M_star_base, attempt)

        # Compute 24 rounds for both messages
        # We compare the state *before* the final Davies-Meyer addition for collisions
        _, states_M = sha256_compress(IV, W, NUM_ROUNDS_DIFFERENTIAL)
        _, states_M_star = sha256_compress(IV, W_star, NUM_ROUNDS_DIFFERENTIAL)

        # Get the state *after* the last round (index NUM_ROUNDS_DIFFERENTIAL)
        # This is the state before the final DM add
        H_M_final_state = states_M[NUM_ROUNDS_DIFFERENTIAL]
        H_M_star_final_state = states_M_star[NUM_ROUNDS_DIFFERENTIAL]

        # Optional: Early abort based on path deviation
        # Pass intermediate states list for this check
        if not check_differential_path(states_M, states_M_star, None): # Pass states list
           continue # Skip if path deviates too much early on

        # Check for collision in the state *before* the final Davies-Meyer addition
        if H_M_final_state == H_M_star_final_state:
            # print(f"\nCollision found by thread at attempt {attempt}!") # Less verbose
            # Also calculate the actual hash output after DM add for reporting
            final_hash_M, _ = sha256_compress(IV, W, NUM_ROUNDS_DIFFERENTIAL)
            result_queue.put((attempt, W, W_star, final_hash_M)) # Return final hash output
            return # Found a collision, exit thread

    # print(f"Thread finished attempts {start_attempt} to {end_attempt-1}") # Verbose


def find_differential_collision_24():
    """Sets up and runs the differential collision search."""
    print(f"Starting {NUM_ROUNDS_DIFFERENTIAL}-round differential collision search...")
    IV = IV_STD

    # Base messages M and M* differing by the input difference of the characteristic.
    # These values depend heavily on the chosen differential path (Ω).
    # Example: Assume difference is introduced in W[12] based on Table 1 example.
    M_base = [0] * 16 # Example base message (could be non-zero)
    M_star_base = list(M_base)
    diff_word_idx = 12
    input_difference = 0x00008000 # Example difference from Table 1
    if diff_word_idx < 16:
        M_star_base[diff_word_idx] = M_base[diff_word_idx] ^ input_difference
    else:
         print("Warning: Differential input difference index out of bounds for base message.")
         # Need a more complex setup if difference is in expanded words

    print(f"Base M (first 16 words): {[f'0x{w:08x}' for w in M_base]}")
    print(f"Base M* (first 16 words): {[f'0x{w:08x}' for w in M_star_base]}")
    print(f"Using {len(NEUTRAL_BITS_CONFIG)} neutral bits.")
    print(f"Searching up to {MAX_ATTEMPTS_DIFFERENTIAL} attempts using {NUM_THREADS_DIFFERENTIAL} threads.")
    print(f"(Note: Finding collisions reliably requires specific high-probability paths and potentially many more attempts)")

    result_queue = Queue()
    threads = []
    # Calculate attempts per thread, ensuring coverage
    attempts_per_thread = (MAX_ATTEMPTS_DIFFERENTIAL + NUM_THREADS_DIFFERENTIAL - 1) // NUM_THREADS_DIFFERENTIAL

    for i in range(NUM_THREADS_DIFFERENTIAL):
        start = i * attempts_per_thread
        end = min(start + attempts_per_thread, MAX_ATTEMPTS_DIFFERENTIAL)
        if start >= MAX_ATTEMPTS_DIFFERENTIAL: break # Don't start threads beyond the limit

        thread = Thread(target=differential_worker, args=(start, end, IV, M_base, M_star_base, result_queue))
        threads.append(thread)
        thread.start()

    collision_found = False
    results = []
    for t in threads:
        t.join() # Wait for all threads to complete

    while not result_queue.empty():
        results.append(result_queue.get())
        collision_found = True

    if collision_found:
        attempt, W, W_star, H_M = results[0] # Show the first collision found
        print("\n--- Differential Collision Found ---")
        print(f"Found after approximately {attempt + 1} attempts (specific attempt in its thread's range).")
        print(f"M (first 16 words): {[f'0x{w:08x}' for w in W[:16]]}")
        print(f"M* (first 16 words): {[f'0x{w:08x}' for w in W_star[:16]]}")
        # The collision occurs in the state *before* the final DM addition.
        # H_M reported here is the final hash *output* after the DM addition.
        print(f"Resulting Hash Output (after {NUM_ROUNDS_DIFFERENTIAL} rounds + DM add): {[f'0x{h:08x}' for h in H_M]}")
        # Verification (optional but good)
        H_M_star, _ = sha256_compress(IV, W_star, NUM_ROUNDS_DIFFERENTIAL)
        print(f"Verification Hash M*: {[f'0x{h:08x}' for h in H_M_star]}")
        print(f"Verification Match: {H_M == H_M_star}")

    else:
        print(f"\nNo differential collision found within {MAX_ATTEMPTS_DIFFERENTIAL} attempts.")
        print("Consider increasing attempts, using more threads, or finding a better differential path/neutral bits.")

# Example Usage:
# find_differential_collision_24()

This code structure allows parallel searching over neutral bit combinations. The actual success and speed depend critically on the chosen differential path's probability and the effectiveness of the neutral bits.

3.3 Multi-Block Collision Attack (Conceptual Code)


# Conceptual: Requires finding suitable near-collision difference and correction path
NUM_ROUNDS_MULTIBLOCK = 24 # Per block

def find_near_collision_block1(IV, target_difference_H1_state):
    """Searches for M1, M1* such that state_diff(Compress(IV, M1), Compress(IV, M1*)) == target_difference_H1_state.
       Comparison is on the state *before* Davies-Meyer final add."""
    print("Searching for near-collision in Block 1...")
    # This requires a dedicated search, likely using differential techniques over NUM_ROUNDS_MULTIBLOCK
    # Placeholder: Simulate finding a pair after some effort
    attempts = 0
    max_attempts = 10000 # Increased placeholder attempts
    while attempts < max_attempts:
        # In reality, M1, M1* would be constructed based on a differential path
        M1 = [random.randint(0, 0xFFFFFFFF) for _ in range(16)]
        M1_star = list(M1)
        # Apply a difference based on the path leading to target_difference_H1_state
        M1_star[0] ^= 0x1 # Minimal difference example for placeholder

        _, states_M1 = sha256_compress(IV, M1, NUM_ROUNDS_MULTIBLOCK)
        _, states_M1_star = sha256_compress(IV, M1_star, NUM_ROUNDS_MULTIBLOCK)

        H1_state = states_M1[NUM_ROUNDS_MULTIBLOCK]
        H1_star_state = states_M1_star[NUM_ROUNDS_MULTIBLOCK]

        current_diff = [(h1 ^ h1s) for h1, h1s in zip(H1_state, H1_star_state)]
        if current_diff == target_difference_H1_state:
             print(f"Block 1 near-collision found after {attempts+1} attempts (simulated).")
             # Calculate the actual hash outputs H1_out = IV + H1_state, H1_star_out = IV + H1_star_state
             H1_out = [add32(iv, h) for iv, h in zip(IV, H1_state)]
             H1_star_out = [add32(iv, h) for iv, h in zip(IV, H1_star_state)]
             return M1, M1_star, H1_out, H1_star_out # Return messages and hash *outputs*
        attempts += 1

    print(f"Block 1 near-collision search failed after {max_attempts} attempts.")
    return None, None, None, None


def find_correction_block2(H1_output, H1_star_output):
    """Searches for M2 such that Compress(H1_output, M2) == Compress(H1_star_output, M2)."""
    # H1_output and H1_star_output are the IVs for the second block compression
    print("Searching for correction in Block 2...")
    # This uses a differential path designed to cancel the difference between H1_output and H1_star_output
    # Placeholder implementation: find single M2 by random search
    attempts = 0
    max_attempts = 10000 # Increased placeholder attempts
    while attempts < max_attempts:
        M2 = [random.randint(0, 0xFFFFFFFF) for _ in range(16)]
        # Compute results using the *different* initial vectors H1_output, H1_star_output
        final_hash_1, _ = sha256_compress(H1_output, M2, NUM_ROUNDS_MULTIBLOCK)
        final_hash_2, _ = sha256_compress(H1_star_output, M2, NUM_ROUNDS_MULTIBLOCK)

        # Check if the final hash *outputs* are the same
        if final_hash_1 == final_hash_2:
             print(f"Block 2 correction found after {attempts+1} attempts (simulated).")
             # In a real attack, might find M2, M2* s.t. Compress(H1_output,M2)=Compress(H1_star_output,M2*)
             # Here we simplify and assume M2*=M2
             return M2, M2, final_hash_1
        attempts += 1

    print(f"Block 2 correction search failed after {max_attempts} attempts.")
    return None, None, None


def find_multi_block_collision():
    """Demonstrates the multi-block collision concept."""
    print(f"Starting {NUM_ROUNDS_MULTIBLOCK}-round multi-block collision search (conceptual)...")
    IV = IV_STD
    # Target difference for the near-collision *state* (must be chosen carefully)
    # Example: Target a difference of 1 in the LSB of A register state before DM add
    target_H1_state_diff = [0] * 8
    target_H1_state_diff[0] = 0x00000001 # Example difference in A state

    M1, M1_star, H1_output, H1_star_output = find_near_collision_block1(IV, target_H1_state_diff)

    if M1 is None:
        print("Failed to find suitable near-collision for Block 1.")
        return

    print(f"Block 1 Output H1: {[f'0x{h:08x}' for h in H1_output]}")
    print(f"Block 1 Output H1*: {[f'0x{h:08x}' for h in H1_star_output]}")
    print(f"Block 1 Output Difference:   {[f'0x{h1 ^ h1s:08x}' for h1, h1s in zip(H1_output, H1_star_output)]}")

    M2, M2_star, H2_final = find_correction_block2(H1_output, H1_star_output)

    if H2_final is not None:
        print("\n--- Multi-Block Collision Found (Conceptual) ---")
        print(f"Message Pair 1: M1 || M2")
        print(f"  M1: {[f'0x{w:08x}' for w in M1]}")
        print(f"  M2: {[f'0x{w:08x}' for w in M2]}")
        print(f"Message Pair 2: M1* || M2*")
        print(f"  M1*: {[f'0x{w:08x}' for w in M1_star]}")
        print(f"  M2*: {[f'0x{w:08x}' for w in M2_star]}") # M2* might be different from M2 in a real attack
        print(f"Final Hash Output (after {NUM_ROUNDS_MULTIBLOCK} + {NUM_ROUNDS_MULTIBLOCK} rounds): {[f'0x{h:08x}' for h in H2_final]}")

        # Verification (Optional but recommended)
        # Recompute full hash for M1||M2 and M1*||M2*
        iv1_check, _ = sha256_compress(IV, M1, NUM_ROUNDS_MULTIBLOCK)
        final1_check, _ = sha256_compress(iv1_check, M2, NUM_ROUNDS_MULTIBLOCK)

        iv2_check, _ = sha256_compress(IV, M1_star, NUM_ROUNDS_MULTIBLOCK)
        final2_check, _ = sha256_compress(iv2_check, M2_star, NUM_ROUNDS_MULTIBLOCK)

        print(f"Verification: Hash(M1||M2) = {[f'0x{h:08x}' for h in final1_check]}")
        print(f"Verification: Hash(M1*||M2*) = {[f'0x{h:08x}' for h in final2_check]}")
        print(f"Verification Match: {final1_check == final2_check}")
        if not (final1_check == final2_check == H2_final):
             print("ERROR: Verification failed!")

    else:
        print("\nFailed to find correction block M2.")

# find_multi_block_collision() # Example call

3.7 GPU-Accelerated Collision (Pseudo-Code)

Illustrates the concept using CUDA C/C++ syntax. Requires NVCC compiler and CUDA toolkit. This is **pseudo-code** and requires implementation of GPU-specific functions and memory management.


#include <stdint.h> // For uint32_t
#include <stdio.h>  // For printf
#include <stdlib.h> // For malloc/free
#include <cuda_runtime.h> // For CUDA API calls

// --- Assumed GPU Helper Functions (Must be implemented in CUDA C) ---
// These functions operate on 32-bit unsigned integers (uint32_t)
// __device__ uint32_t Sigma0_gpu(uint32_t x);
// __device__ uint32_t Sigma1_gpu(uint32_t x);
// __device__ uint32_t sigma0_gpu(uint32_t x);
// __device__ uint32_t sigma1_gpu(uint32_t x);
// __device__ uint32_t Ch_gpu(uint32_t x, uint32_t y, uint32_t z);
// __device__ uint32_t Maj_gpu(uint32_t x, uint32_t y, uint32_t z);
// __device__ uint32_t add32_gpu(uint32_t a, uint32_t b, ...); // Variadic or fixed args
// __device__ void apply_neutral_bits_gpu(const uint32_t* M_base, uint32_t* W_thread, int attempt_idx, const int* neutral_config, int num_neutral);
// __constant__ uint32_t K_gpu[64]; // SHA-256 round constants in constant memory

// CUDA Kernel for SHA-256 24-round computation (Simplified Pseudo-Code)
__global__ void sha256_24_round_kernel(
    const uint32_t* initial_IV, // Device pointer to IV
    const uint32_t* M_base,     // Device pointer to Base message W[0..15]
    const uint32_t* M_star_base,// Device pointer to Base message W*[0..15]
    uint32_t* output_states,    // Device pointer to output array [State_M_0..7, State_M*_0..7, ...]
    int total_attempts_in_launch, // Total attempts across all blocks in this launch
    const int* d_neutral_config, // Device pointer to neutral bit config array (pairs of word/bit index)
    int num_neutral_bits)       // Number of neutral bits
{
    int global_attempt_idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Bounds check for threads exceeding total attempts
    if (global_attempt_idx >= total_attempts_in_launch) return;

    // Use registers for W schedule - more efficient than shared memory if fits
    uint32_t W[24];
    uint32_t W_star[24];

    // State registers for this thread
    uint32_t state[8];
    uint32_t state_star[8];

    // Apply neutral bits for this thread's attempt (GPU version)
    // This function would derive W[0..15] and W*[0..15] from bases and global_attempt_idx
    // Requires passing neutral bit configuration (d_neutral_config, num_neutral_bits)
    // apply_neutral_bits_gpu(M_base, W, global_attempt_idx, d_neutral_config, num_neutral_bits);
    // apply_neutral_bits_gpu(M_star_base, W_star, global_attempt_idx, d_neutral_config, num_neutral_bits);
    // ** Placeholder: Assume W and W_star are correctly initialized based on attempt_idx **
    // Example: Copy base and apply difference for simplicity here
    for(int i=0; i<16; ++i) { W[i] = M_base[i]; W_star[i] = M_star_base[i]; }
    // Simple neutral bit application simulation (replace with actual apply_neutral_bits_gpu call)
    for (int i = 0; i < num_neutral_bits; ++i) {
        if ((global_attempt_idx >> i) & 1) {
             // int word_idx = d_neutral_config[i*2 + 0];
             // int bit_pos = d_neutral_config[i*2 + 1];
             // if (word_idx < 16) { W[word_idx] ^= (1 << bit_pos); W_star[word_idx] ^= (1 << bit_pos); }
        }
    }


    // Expand W and W* from 16 to 24 words directly in registers (assuming sigma0/1_gpu exist)
    for(int i=16; i<24; ++i) {
        // uint32_t s0 = sigma0_gpu(W[i-15]);
        // uint32_t s1 = sigma1_gpu(W[i-2]);
        // W[i] = add32_gpu(W[i-16], s0, W[i-7], s1); // Assumes add32_gpu handles multiple args

        // uint32_t s0_star = sigma0_gpu(W_star[i-15]);
        // uint32_t s1_star = sigma1_gpu(W_star[i-2]);
        // W_star[i] = add32_gpu(W_star[i-16], s0_star, W_star[i-7], s1_star);
         // ** Placeholder: Assume W/W* expansion is done correctly **
         W[i] = 0; W_star[i] = 0; // Needs real expansion logic
    }

    // Initialize states from IV (use global memory directly)
    for(int i=0; i<8; ++i) {
        state[i] = initial_IV[i];
        state_star[i] = initial_IV[i];
    }

    // --- Compute 24 rounds for M (assuming GPU functions exist) ---
    uint32_t a=state[0], b=state[1], c=state[2], d=state[3], e=state[4], f=state[5], g=state[6], h=state[7];
    for(int i=0; i<24; ++i) {
        // uint32_t T1 = add32_gpu(h, Sigma1_gpu(e), Ch_gpu(e, f, g), K_gpu[i], W[i]);
        // uint32_t T2 = add32_gpu(Sigma0_gpu(a), Maj_gpu(a, b, c));
        // h = g; g = f; f = e;
        // e = add32_gpu(d, T1);
        // d = c; c = b; b = a;
        // a = add32_gpu(T1, T2);
         // ** Placeholder: Assume round computation is done correctly **
    }
    // Save final state before DM add
    state[0]=a; state[1]=b; state[2]=c; state[3]=d; state[4]=e; state[5]=f; state[6]=g; state[7]=h;


    // --- Compute 24 rounds for M* (assuming GPU functions exist) ---
    a=state_star[0]; b=state_star[1]; c=state_star[2]; d=state_star[3]; e=state_star[4]; f=state_star[5]; g=state_star[6]; h=state_star[7];
    for(int i=0; i<24; ++i) {
        // uint32_t T1 = add32_gpu(h, Sigma1_gpu(e), Ch_gpu(e, f, g), K_gpu[i], W_star[i]);
        // uint32_t T2 = add32_gpu(Sigma0_gpu(a), Maj_gpu(a, b, c));
        // h = g; g = f; f = e;
        // e = add32_gpu(d, T1);
        // d = c; c = b; b = a;
        // a = add32_gpu(T1, T2);
         // ** Placeholder: Assume round computation is done correctly **
    }
    // Save final state before DM add
    state_star[0]=a; state_star[1]=b; state_star[2]=c; state_star[3]=d; state_star[4]=e; state_star[5]=f; state_star[6]=g; state_star[7]=h;

    // Store final states (before DM add) for host check
    // Use coalesced writes if possible: write all 8 words for M, then all 8 for M*
    int output_base_idx = global_attempt_idx * 16; // 8 for State_M, 8 for State_M*
    for(int i=0; i<8; ++i) {
        output_states[output_base_idx + i] = state[i];
        output_states[output_base_idx + 8 + i] = state_star[i];
    }
}

// Host code (C++) to launch the kernel and check results (Simplified)
void find_gpu_collision() {
    int total_num_attempts = 1 << 20; // e.g., 2^20 attempts
    int threads_per_block = 256;
    int num_blocks = (total_num_attempts + threads_per_block - 1) / threads_per_block;
    int actual_total_threads = num_blocks * threads_per_block; // Total threads launched

    // --- Host Data Setup ---
    uint32_t h_iv[8] = {0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A, 0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19};
    uint32_t h_m_base[16] = {0}; // Example base message
    uint32_t h_m_star_base[16] = {0};
    h_m_star_base[12] = h_m_base[12] ^ 0x00008000; // Example difference

    // Example neutral bit config (host side) - must match kernel access logic
    int h_neutral_config[] = {0, 31, 1, 30, 2, 29, 3, 31, 4, 30, 5, 29, 6, 28, 7, 27,
                              8, 31, 9, 30, 10, 29, 11, 28, 12, 27, 2, 30, 3, 29, 4, 28,
                              0, 15, 1, 14, 2, 13, 3, 12}; // 20 bits * 2 entries/bit = 40 ints
    int num_neutral_bits = 20;

    // --- Device Memory Allocation ---
    uint32_t *d_iv, *d_m_base, *d_m_star_base, *d_output_states;
    int *d_neutral_config;
    cudaMalloc(&d_iv, 8 * sizeof(uint32_t));
    cudaMalloc(&d_m_base, 16 * sizeof(uint32_t));
    cudaMalloc(&d_m_star_base, 16 * sizeof(uint32_t));
    cudaMalloc(&d_neutral_config, num_neutral_bits * 2 * sizeof(int));
    size_t output_size = actual_total_threads * 16 * sizeof(uint32_t); // Allocate for actual threads
    cudaMalloc(&d_output_states, output_size);

    // --- Memory Copy Host -> Device ---
    cudaMemcpy(d_iv, h_iv, 8 * sizeof(uint32_t), cudaMemcpyHostToDevice);
    cudaMemcpy(d_m_base, h_m_base, 16 * sizeof(uint32_t), cudaMemcpyHostToDevice);
    cudaMemcpy(d_m_star_base, h_m_star_base, 16 * sizeof(uint32_t), cudaMemcpyHostToDevice);
    cudaMemcpy(d_neutral_config, h_neutral_config, num_neutral_bits * 2 * sizeof(int), cudaMemcpyHostToDevice);
    // Also need to copy K constants to __constant__ memory (not shown here)

    // --- Allocate Host Memory for Results ---
    uint32_t* h_output_states = (uint32_t*)malloc(output_size);
    if (!h_output_states) { fprintf(stderr, "Failed to allocate host memory for results\n"); return; }

    printf("Launching GPU Kernel (%d blocks, %d threads/block, %d total threads)...\n",
           num_blocks, threads_per_block, actual_total_threads);

    // --- Launch Kernel ---
    // Pass total number of actual threads for bounds checking inside kernel
    sha256_24_round_kernel<<>>(
        d_iv, d_m_base, d_m_star_base, d_output_states, actual_total_threads,
        d_neutral_config, num_neutral_bits);

    // Check for kernel launch errors
    cudaError_t kernelError = cudaGetLastError();
    if (kernelError != cudaSuccess) {
        fprintf(stderr, "Kernel launch failed: %s\n", cudaGetErrorString(kernelError));
        // Cleanup and return...
        free(h_output_states); cudaFree(d_iv); /* ... free other d_ pointers ... */ return;
    }

    // Wait for kernel to finish and check for execution errors
    cudaError_t syncError = cudaDeviceSynchronize();
     if (syncError != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize failed after kernel launch: %s\n", cudaGetErrorString(syncError));
        // Cleanup and return...
        free(h_output_states); cudaFree(d_iv); /* ... free other d_ pointers ... */ return;
    }
    printf("GPU Kernel finished.\n");

    // --- Copy results back to host ---
    cudaError_t memcpyError = cudaMemcpy(h_output_states, d_output_states, output_size, cudaMemcpyDeviceToHost);
    if (memcpyError != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy D->H failed: %s\n", cudaGetErrorString(memcpyError));
        // Cleanup and return...
        free(h_output_states); cudaFree(d_iv); /* ... free other d_ pointers ... */ return;
    }

    // --- Check for collisions on the host ---
    printf("Checking results on host...\n");
    bool collision_found_host = false;
    for (int i = 0; i < actual_total_threads; ++i) { // Iterate up to actual threads launched
        bool collision = true;
        int offset = i * 16; // Base index for attempt i's results
        for (int j = 0; j < 8; ++j) { // Compare state words A-H
            if (h_output_states[offset + j] != h_output_states[offset + 8 + j]) {
                collision = false;
                break;
            }
        }
        if (collision) {
            printf("Collision found at attempt index %d!\n", i);
            // To show M/M*, you would need to reconstruct them on the host
            // using the same apply_neutral_bits logic used in the kernel for attempt 'i'.
            // Example: reconstruct_M_on_host(h_m_base, i, h_neutral_config, num_neutral_bits);
            collision_found_host = true;
            break; // Found first collision
        }
    }
    if (!collision_found_host) {
        printf("No collision found on host within %d attempts.\n", actual_total_threads);
    }
    printf("Host check complete.\n");

    // --- Cleanup ---
    free(h_output_states);
    cudaFree(d_iv);
    cudaFree(d_m_base);
    cudaFree(d_m_star_base);
    cudaFree(d_output_states);
    cudaFree(d_neutral_config);
}

// int main() { find_gpu_collision(); return 0; } // Example main

This pseudo-code outlines the structure. A real implementation requires careful handling of the message schedule expansion within the kernel (potentially parallelizing it), efficient neutral bit application on the GPU, correct memory management (especially coalescing), and implementation of all SHA-256 functions for the GPU.

3.8 Quantum-Inspired Collision (Classical Simulation)

This code simulates the idea of Grover's search classically. It does not provide a quantum speedup.


import math # Ensure math is imported

def quantum_inspired_collision_simulation():
    """Classically simulates the structure of a Grover-like search."""
    print("Starting Quantum-Inspired Collision Simulation (Classical Hardware)...")
    print("** This does NOT provide a real quantum speedup! **")

    NUM_ROUNDS_QI = 24
    # Define the size of the search space being simulated (e.g., based on neutral bits)
    TOTAL_SEARCH_SPACE_EXPONENT = len(NEUTRAL_BITS_CONFIG) # e.g., 20 bits from config above
    CLASSICAL_ATTEMPTS = 2**TOTAL_SEARCH_SPACE_EXPONENT
    # Theoretical quantum attempts = sqrt(CLASSICAL_ATTEMPTS)
    SIMULATED_ATTEMPTS = int(math.sqrt(CLASSICAL_ATTEMPTS)) # ~2^(20/2) = 2^10 = 1024

    print(f"Classically simulating ~{SIMULATED_ATTEMPTS} 'quantum' iterations (sqrt of 2^{TOTAL_SEARCH_SPACE_EXPONENT}).")

    IV = IV_STD
    # Base messages depend on the specific differential being targeted
    M_base = [0] * 16
    M_star_base = list(M_base)
    M_star_base[12] ^= 0x00008000 # Example difference

    found = False
    # Simulate probing the search space sqrt(N) times
    for i in range(SIMULATED_ATTEMPTS):
        # Use a pseudo-random index based on iteration to simulate probing the space
        # This doesn't truly mimic Grover's amplitude amplification but shows reduced checks.
        attempt_num = random.randint(0, CLASSICAL_ATTEMPTS - 1) # Pick a random candidate

        W = apply_neutral_bits_differential(M_base, attempt_num) # Reuse neutral bit logic
        W_star = apply_neutral_bits_differential(M_star_base, attempt_num)

        # Check collision based on state before DM add
        _, states_M = sha256_compress(IV, W, NUM_ROUNDS_QI)
        _, states_M_star = sha256_compress(IV, W_star, NUM_ROUNDS_QI)
        H_M_state = states_M[NUM_ROUNDS_QI]
        H_M_star_state = states_M_star[NUM_ROUNDS_QI]


        # Oracle check: Did we find a collision in the state?
        if H_M_state == H_M_star_state:
            print(f"\n--- Quantum-Inspired Simulation Found Collision ---")
            print(f"Found at simulated iteration {i+1} (probed classical index {attempt_num}).")
            print(f"M: {[f'0x{w:08x}' for w in W[:16]]}")
            print(f"M*: {[f'0x{w:08x}' for w in W_star[:16]]}")
            # Report the final hash output after DM add
            final_hash, _ = sha256_compress(IV, W, NUM_ROUNDS_QI)
            print(f"Hash Output: {[f'0x{h:08x}' for h in final_hash]}")
            found = True
            break # Exit simulation

    if not found:
        print(f"\nNo collision found within {SIMULATED_ATTEMPTS} simulated quantum iterations.")

# quantum_inspired_collision_simulation() # Example call

This simulation highlights the potential reduction in queries Grover's algorithm offers, but the actual work per iteration remains classical.

3.9 Reconstruction from Rounds 46–64

Reverses SHA-256 from known late-round states, assuming standard IV and a known input block (e.g., all zeros).


def rebuild_sha256_from_late_rounds(known_states_47_to_64):
    """
    Reconstructs IV and W[0..15] from known states AFTER rounds 46-63
    (which are states indexed 47 to 64 in the round_states list returned by sha256_compress).
    Assumes standard IV and an all-zero input block W[0..15]=0 fed into the compression function.
    """
    print("Starting reconstruction from rounds 46-64...")
    # Input should be states[47]...states[64] (18 states total)
    # state[i] is the state *after* round i-1 computation.
    # So known_states_47_to_64[0] is state[47] (after round 46)
    # known_states_47_to_64[17] is state[64] (after round 63, before final DM add)
    if len(known_states_47_to_64) != 18: # 64 - 47 + 1 = 18 states needed
        raise ValueError(f"Need exactly 18 states (output of round 46 to output of round 63), got {len(known_states_47_to_64)}")

    # Initialize state arrays (A-H) and message schedule array (W)
    # state[i] holds the tuple (A,B,C,D,E,F,G,H) *after* round i-1 computation.
    # So state[0] is IV, state[64] is the state *before* final addition.
    state = [(0,) * 8] * 65 # Indices 0..64
    W = [0] * 64            # Indices 0..63

    # Populate known states (output of rounds 46..63 -> stored in state[47]..state[64])
    for i in range(18):
        state_index = 47 + i # Store in state[47]..state[64]
        state[state_index] = known_states_47_to_64[i]

    print("Known states populated from state[47] to state[64].")

    # Reconstruct W[i] and state[i] backwards from round 63 down to 0
    for i in range(63, -1, -1):
        # State *after* round i (A_ip1, ..., H_ip1) is state[i+1]
        A_ip1, B_ip1, C_ip1, D_ip1, E_ip1, F_ip1, G_ip1, H_ip1 = state[i+1]

        # Derive state variables *before* round i (input to round i) using simple shifts/assignments first
        # These are the state variables at the *start* of round i+1 computation, which are derived
        # from the state variables at the *start* of round i computation after the shifts.
        # A_i = state[i][0], B_i = state[i][1], etc.
        # We know: B_ip1 = A_i, C_ip1 = B_i, D_ip1 = C_i, F_ip1 = E_i, G_ip1 = F_i, H_ip1 = G_i
        A_i = B_ip1
        B_i = C_ip1
        C_i = D_ip1
        # D_i needs T1
        E_i = F_ip1
        F_i = G_ip1
        G_i = H_ip1
        # H_i needs T1

        # Calculate T2 using A_i, B_i, C_i (which are now known)
        T2 = add32(Sigma0(A_i), Maj(A_i, B_i, C_i))

        # Calculate T1 using A_ip1 and T2: A_ip1 = T1 + T2 => T1 = A_ip1 - T2
        T1 = add32(A_ip1, -T2) # add32 handles modulo 2^32 arithmetic

        # Calculate D_i using E_ip1 and T1: E_ip1 = D_i + T1 => D_i = E_ip1 - T1
        D_i = add32(E_ip1, -T1)

        # Now we need to find W[i] and H_i from the T1 equation:
        # T1 = H_i + Sigma1(E_i) + Ch(E_i, F_i, G_i) + K[i] + W[i]

        # Determine W[i] by isolating it:
        # W[i] = T1 - H_i - Sigma1(E_i) - Ch(E_i, F_i, G_i) - K[i]
        # However, we don't know H_i yet. We need to use the message schedule.

        # If i >= 16, W[i] is determined by the forward schedule using W values we should have already reconstructed
        # (since we are going backwards, W[i-2], W[i-7], W[i-15], W[i-16] correspond to later rounds' inputs).
        # If i < 16, W[i] is part of the original message block.

        # Let's try to find H_i first, then W[i]
        # H_i = G_ip1 (This was wrong, H_ip1 = G_i) -> G_i = H_ip1
        # We already found G_i = H_ip1 above.

        # Re-evaluate T1 equation: T1 = H_i + Sigma1(E_i) + Ch(E_i, F_i, G_i) + K[i] + W[i]
        # We know T1, E_i, F_i, G_i, K[i]. We need H_i and W[i].

        # Let's use the message schedule logic first.
        # If i >= 16, calculate W[i] based on previously reconstructed W's
        if i >= 16:
            # Ensure indices are valid (we are going backward, indices i-16 etc are *later* rounds)
            # Check if the needed W values have been computed in previous iterations of this loop
            # (Since we iterate i=63 down to 16, W[i-2], W[i-7], W[i-15], W[i-16] should be available if W array is filled correctly)
            s0_rec = sigma0(W[i-15]) # W[i-15] was computed when loop variable was i-15
            s1_rec = sigma1(W[i-2])  # W[i-2] was computed when loop variable was i-2
            w_i_calc = add32(W[i-16], s0_rec, W[i-7], s1_rec)
            W[i] = w_i_calc # Store reconstructed W[i]
        else: # i < 16
            # For rounds 0-15, W[i] is unknown initially. We *assume* it's 0 for this specific reconstruction scenario.
            # A more general reconstruction wouldn't make this assumption.
            # We will derive W[i] from the T1 equation instead, and later check if it matches 0.
            pass # Defer calculation of W[i] for i < 16

        # Now calculate H_i using T1 and W[i] (if i >= 16)
        # H_i = T1 - Sigma1(E_i) - Ch(E_i, F_i, G_i) - K[i] - W[i]
        # If i < 16, we can't calculate H_i yet as W[i] is unknown.
        # There must be a simpler way... Ah, H_i is simply G_i shifted! H_i = G_i_prev_round = state[i][7]
        # Let's use the state assignments directly:
        # state[i] = (A_i, B_i, C_i, D_i, E_i, F_i, G_i, H_i)
        # We need H_i. We know H_ip1 = G_i. This doesn't help find H_i.

        # Let's rethink the T1 equation and W[i] derivation.
        # T1 = H_i + Sigma1(E_i) + Ch(E_i, F_i, G_i) + K[i] + W[i]
        # We know T1, E_i, F_i, G_i, K[i].
        # We need to find H_i and W[i].

        # Try isolating W[i] first, assuming we can find H_i.
        # H_i is the value of H *before* round i. It's state[i][7].
        # But state[i] is what we are trying to compute!

        # Let's use the known state relations more carefully.
        # State at end of round i: (A_ip1, B_ip1, C_ip1, D_ip1, E_ip1, F_ip1, G_ip1, H_ip1) = state[i+1]
        # State at start of round i: (A_i, B_i, C_i, D_i, E_i, F_i, G_i, H_i) = state[i]
        # Relations:
        # A_ip1 = (H_i + S1(E_i) + Ch(E_i,F_i,G_i) + K_i + W_i) + (S0(A_i) + Maj(A_i,B_i,C_i))
        # B_ip1 = A_i
        # C_ip1 = B_i
        # D_ip1 = C_i
        # E_ip1 = D_i + (H_i + S1(E_i) + Ch(E_i,F_i,G_i) + K_i + W_i)
        # F_ip1 = E_i
        # G_ip1 = F_i
        # H_ip1 = G_i

        # From these, we can directly find A_i, B_i, C_i, E_i, F_i, G_i from state[i+1]:
        A_i = B_ip1
        B_i = C_ip1
        C_i = D_ip1
        E_i = F_ip1
        F_i = G_ip1
        G_i = H_ip1

        # Now calculate T2 = Sigma0(A_i) + Maj(A_i, B_i, C_i)
        T2 = add32(Sigma0(A_i), Maj(A_i, B_i, C_i))
        # Calculate T1 = A_ip1 - T2
        T1 = add32(A_ip1, -T2)

        # Now find D_i from E_ip1 = D_i + T1 => D_i = E_ip1 - T1
        D_i = add32(E_ip1, -T1)

        # Finally, we need H_i and W[i]. Use the T1 definition:
        # T1 = H_i + Sigma1(E_i) + Ch(E_i, F_i, G_i) + K[i] + W[i]
        # Rearrange: H_i + W[i] = T1 - Sigma1(E_i) - Ch(E_i, F_i, G_i) - K[i]
        Sum_H_W = add32(T1, -Sigma1(E_i), -Ch(E_i, F_i, G_i), -K[i])

        # Now we need either H_i or W[i] to find the other.
        # If i >= 16, we can calculate W[i] from the message schedule using already found W values.
        if i >= 16:
            s0_rec = sigma0(W[i-15])
            s1_rec = sigma1(W[i-2])
            W[i] = add32(W[i-16], s0_rec, W[i-7], s1_rec)
            # Now find H_i = Sum_H_W - W[i]
            H_i = add32(Sum_H_W, -W[i])
        else: # i < 16
            # We assume W[i] = 0 for the reconstruction of the zero block.
            # Let's calculate H_i based on this assumption.
            assumed_W_i = 0
            H_i = add32(Sum_H_W, -assumed_W_i)
            # Store the assumed W[i] for verification later.
            W[i] = assumed_W_i

        # Store the fully reconstructed state *before* round i (i.e., the input state for round i)
        state[i] = (A_i, B_i, C_i, D_i, E_i, F_i, G_i, H_i)
        # print(f"Reconstructed state[{i}]: {[f'0x{x:08x}' for x in state[i]]}, W[{i}]=0x{W[i]:08x}") # Verbose


    # Extract the reconstructed IV (state before round 0, which is state[0])
    reconstructed_IV = state[0]
    reconstructed_W0_15 = W[:16]

    print("\n--- Reconstruction Complete ---")
    print(f"Reconstructed IV: {[f'0x{x:08x}' for x in reconstructed_IV]}")
    print(f"Standard IV:      {[f'0x{x:08x}' for x in IV_STD]}")
    print(f"IV Match: {reconstructed_IV == tuple(IV_STD)}")

    print(f"\nReconstructed Input Block W[0..15]:")
    print(' '.join([f"0x{w:08x}" for w in reconstructed_W0_15]))
    assumed_W0_15 = [0] * 16
    print(f"Assumed Input Block W[0..15] (for this scenario):")
    print(' '.join([f"0x{w:08x}" for w in assumed_W0_15]))
    print(f"Input Block Match: {reconstructed_W0_15 == assumed_W0_15}")

    # Optional: Print full reconstructed W[0..63]
    # print(f"\nFull Reconstructed W[0..63]:")
    # print(' '.join([f'0x{w:08x}' for w in W]))

    # Final validation: Re-compute forward and check states
    print("\nRunning forward validation...")
    validation_hash_output, validation_states = sha256_compress(reconstructed_IV, reconstructed_W0_15, 64)
    valid = True
    # Check if reconstructed states match the known input states
    for r in range(47, 65): # Check states state[47]..state[64]
        known_state_idx = r - 47
        if state[r] != known_states_47_to_64[known_state_idx]:
             print(f"State mismatch at index {r} (Input vs Reconstructed)!")
             print(f"  Input Known:   {[f'0x{x:08x}' for x in known_states_47_to_64[known_state_idx]]}")
             print(f"  Reconstructed: {[f'0x{x:08x}' for x in state[r]]}")
             valid = False
             # break # Stop at first error

    # Check if forward computation using reconstructed IV/W matches known input states
    for r in range(47, 65): # Check states state[47]..state[64]
        known_state_idx = r - 47
        if validation_states[r] != known_states_47_to_64[known_state_idx]:
             print(f"State mismatch at index {r} (Forward Validation vs Input)!")
             print(f"  Input Known:    {[f'0x{x:08x}' for x in known_states_47_to_64[known_state_idx]]}")
             print(f"  Forward Valid:  {[f'0x{x:08x}' for x in validation_states[r]]}")
             valid = False
             # break # Stop at first error

    print(f"Forward validation match: {valid}")


# --- Function to get actual late states for testing ---
def get_late_states_for_zero_block():
    """ Computes SHA-256 for an all-zero block and returns states AFTER rounds 46-63 """
    zero_block = [0] * 16
    # Standard padding for a single 512-bit block: append '1' bit (0x80...), then zeros, then 64-bit length (512 = 0x200)
    # For a block that is exactly 512 bits, padding requires a *second* block.
    # Let's assume the input is just the 16 zero words, without padding, for simplicity of reconstruction demo.
    # If padding was included, the W block would be different.
    # Hashing exactly 512 bits (16 words) means W[0..15] are the message, padding starts in the *next* block.
    # Let's hash the 16 zero words directly.
    print("Computing forward SHA-256 for all-zero block (W[0..15]=0) with standard IV...")
    final_hash, all_states = sha256_compress(IV_STD, zero_block, 64)
    # all_states[0] = state before round 0 (IV)
    # all_states[1] = state after round 0
    # ...
    # all_states[47] = state after round 46
    # ...
    # all_states[64] = state after round 63 (state before final DM add)
    if len(all_states) != 65:
         raise ValueError(f"sha256_compress returned {len(all_states)} states, expected 65")
    # Return states AFTER round 46 up to AFTER round 63 (indices 47 through 64)
    print("Forward computation complete.")
    return all_states[47:65] # 18 states needed for rebuild function

# --- Example Usage ---
try:
    # Get the actual late states from processing a zero block
    actual_late_states = get_late_states_for_zero_block()
    print(f"Obtained {len(actual_late_states)} late states (output of rounds 46-63) for reconstruction.")
    # Optional: Print the states being used for reconstruction
    # print("States used for reconstruction (state[47] to state[64]):")
    # for i, st in enumerate(actual_late_states):
    #     print(f" State[{47+i}]: {[f'0x{x:08x}' for x in st]}")

    if len(actual_late_states) == 18:
         rebuild_sha256_from_late_rounds(actual_late_states)
    else:
         print(f"Error: Failed to obtain the correct number of late states (needed 18, got {len(actual_late_states)}).")

except Exception as e:
     print(f"\nError during reconstruction or state generation: {e}")
     traceback.print_exc()
     print("Ensure correct state data and logic.")

This reconstruction code demonstrates the backward calculation. Its success relies heavily on the accuracy of the provided late-round states and the correctness of the inversion logic, especially concerning the message schedule and the assumption about the input block.

Table 3: Example State Differences (`Δ = State ⊕ State*`) at Key Rounds (Illustrative)
Technique	Round (i)	ΔAi+1	ΔEi+1	ΔW[i]	Notes
Differential	11 (after)	≈0	≈0	ΔW[0..11]≈0	End of controlled phase (neutral bits)
Differential	12 (after)	e.g., 80000000	e.g., 80000000	ΔW[12]=8000	Difference introduced & propagates (simplified example)
Differential	17 (after)	e.g., 01ABCDEF	e.g., FEDCBA10	ΔW[17]≠0	Difference evolves via state/schedule
Differential	23 (after)	00000000	00000000	ΔW[23]≠0	Target: Zero difference before final add
Boomerang	11 (after)	β_A (non-zero)	β_E (non-zero)	...	Difference after E1 path (state before DM add)
Boomerang	12 (Input)	Input Diff for E2	Input Diff for E2	γ (non-zero)	Difference for E2 path introduced

Appendix A: Differential Characteristics (Ω) and Constraint Equations (G) for SHA-256

This appendix provides a conceptual overview and illustrative examples of differential characteristics (Ω) and the associated set of constraint equations (G) for SHA-256, as referenced in Section 2.1.1 and relevant to differential cryptanalysis techniques.

A.1 The Differential Characteristic Ω

A differential characteristic, denoted Ω, specifies the desired propagation of differences through the rounds of the SHA-256 compression function for a specific attack (e.g., a collision attack over N rounds). It is defined by a tuple:

Ω = (ΔM, {ΔState_i}_i=0..N, {ΔW_i}_i=0..N-1)

Where:

ΔM: The initial difference between the two input message blocks (M and M*), typically defined as the XOR difference ΔM = M ⊕ M*. This difference is usually localized to a few words in the first 16 words (W[0]..W[15]).
{ΔState_i}_i=0..N: The sequence of expected XOR differences in the internal state variables (A, B, C, D, E, F, G, H) over N rounds.
- ΔState_i = (ΔA_i, ΔB_i, ..., ΔH_i) represents the expected XOR difference in the state variables before the computation of round i (i.e., the output state of round i-1). Let S_i[j] denote the j-th bit of state variable S before round i. Then ΔS_i[j] = S_i[j] ⊕ S*_i[j].
- ΔState₀ is typically zero if both computations start with the same IV.
- For a collision attack, the target is often ΔState_N = (0, 0, ..., 0) before the final Davies-Meyer feed-forward addition.
{ΔW_i}_i=0..N-1: The sequence of expected XOR differences in the expanded message schedule words.
- ΔW_i = W_i ⊕ W*_i. Let W_i[j] denote the j-th bit.
- For i < 16, ΔW_i is determined directly by ΔM.
- For i ≥ 16, ΔW_i depends on the differences in the previous message words (ΔW_i-16, ΔW_i-15, ΔW_i-7, ΔW_i-2) and the properties of the SHA-256 message schedule functions (σ0, σ1) and modular addition. Controlling ΔW_i according to the characteristic is crucial.

Finding a characteristic Ω with a high probability of occurrence (i.e., minimizing the number of conditions that must be met in probabilistic steps) is the first step in a differential attack.

A.2 The Set of Constraint Equations G

The set G represents the collection of conditions (equations) that must be satisfied by the actual values of the message words (M, M*) and the intermediate state variables during computation for the specific differential characteristic Ω to hold. If these conditions are not met at any round, the propagation of differences will likely deviate from the path defined by Ω.

These equations arise primarily from the non-linear operations within the SHA-256 round function and message schedule:

Boolean Functions (Ch, Maj, Σ0, Σ1, σ0, σ1): For a given input difference (e.g., ΔX) to a function f, the output difference Δf = f(X) ⊕ f(X*) must match the difference specified in Ω. This imposes conditions on the specific bit values of the inputs X and X*. For example, for Ch(E, F, G), the condition Ch(Ei, Fi, Gi) ⊕ Ch(Ei⊕ΔEi, Fi⊕ΔFi, Gi⊕ΔGi) = ΔChi must hold, where ΔCh_i is derived from Ω. These often translate to linear equations over GF(2) involving the input bits.
Modular Addition (+ mod 2³²): This is often the most significant source of constraints. The XOR difference propagation through addition is complex and depends on carry bits. Let Z = X + Y and Z* = X* + Y*. We need Z ⊕ Z* = ΔZ_target. Whether this holds depends not only on ΔX = X⊕X* and ΔY = Y⊕Y*, but also on the actual values of X and Y because they determine the carry propagation. These conditions translate into many bit-level equations involving the input bits (X[j], Y[j]) and carry bits (c[j]) at each position j. These are non-linear over GF(2).
Message Schedule Expansion: For i ≥ 16, the equation is W[i] = W[i-16] + σ0(W[i-15]) + W[i-7] + σ1(W[i-2]). The difference ΔW[i] must conform to the characteristic Ω. This imposes conditions derived from the propagation of differences through σ0, σ1, and the three modular additions, relating ΔW[i] to ΔW_i-16, ΔW_i-15, ΔW_i-7, ΔW_i-2 and the actual values of the words involved.

Satisfying the set of equations G is the core challenge in the message modification phase of a differential attack. Techniques used include:

Choosing a "good" characteristic Ω: One where G contains fewer or easier-to-satisfy equations (higher probability `p`).
Message Modification: Iteratively adjusting bits of the message words (M, M*) to fulfill the conditions in G round by round.
Neutral Bits: Using bits in the initial message M that can be flipped without affecting the characteristic Ω for several early rounds. These bits provide degrees of freedom to help satisfy equations in later, more difficult rounds.
Automated Tools: Using SAT/SMT solvers or specialized search algorithms to find message pairs (M, M*) that satisfy G for a given Ω.

In summary, Ω defines the path, and G defines the conditions required to stay on that path. Successful differential cryptanalysis hinges on finding a viable path Ω and an efficient method to find messages M, M* that satisfy the corresponding constraints G.

A.3 Example Constraint Equations (Set G - Illustrative)

The following table shows examples of the type of bit-level conditions that might appear in the set G for a specific SHA-256 characteristic Ω, focusing on a hypothetical round i. These equations ensure that the actual XOR difference propagation matches the target difference specified in Ω. (Notation: S_i[j] is the j-th bit of state variable S before round i, ΔS_i[j] is the corresponding XOR difference bit, c_i[j] is the j-th carry bit in an addition during round i).

Table A.1: Example Bit-Level Constraint Equations for Round i (Illustrative)
Source Operation	Example Constraint Equation	Notes
Maj(Ai, Bi, Ci)	(Ai[5] ⊕ Bi[5])·ΔCi[5] ⊕ (Ai[5] ⊕ Ci[5])·ΔBi[5] ⊕ (Bi[5] ⊕ Ci[5])·ΔAi[5] = ΔMaji[5]	Ensures XOR difference ΔMaji[5] (from Ω) holds. This is a linear equation in Ai[5], Bi[5], Ci[5] if ΔA, ΔB, ΔC are fixed by Ω.
Ch(Ei, Fi, Gi)	(Fi[12] ⊕ Gi[12])·ΔEi[12] ⊕ Ei[12]·ΔFi[12] ⊕ (1 ⊕ Ei[12])·ΔGi[12] = ΔChi[12]	Ensures XOR difference ΔChi[12] (from Ω) holds. Linear in Fi[12], Gi[12] if Ei[12] and Δ's are fixed.
Addition (e.g., Z = X + Y)	X[j] ⊕ Y[j] ⊕ c[j] = Z[j] ⊕ c[j+1]	Bit-level relation involving carry c[j] (carry into position j) and c[j+1] (carry out of position j).
Addition (XOR difference)	ΔX[j] ⊕ ΔY[j] ⊕ (c[j] ⊕ c[j]) = ΔZ[j] ⊕ (c[j+1] ⊕ c[j+1])	Condition on the difference propagation through addition. Requires c[j] ⊕ c*[j] (difference in input carry) to be controlled/predicted to ensure target ΔZ[j]. This often leads to conditions on X[k], Y[k] for k < j.
Message Schedule (σ0)	ΔWi-15[7] ⊕ ΔWi-15[18] ⊕ ΔWi-15[3] = Δσ0i-15[...]	Condition arising from difference propagation through σ0 (linear over GF(2)).

A.4 Example Differential Characteristic Segment (Ω - 39-Round Example)

Instead of a purely illustrative example, the following table shows the actual XOR difference propagation derived from a known 39-round collision for the SHA-256 compression function. This collision uses the specific initial vector H0_collision = [0x02b19d5a, ..., 0x651b92e7] and the message block pair W0_collision / W1_collision (differing primarily in words 8-12) as defined in the demonstration code found in SHA-256 Tools or similar sources detailing this specific collision.

This table concretely illustrates the state difference sequence {ΔState_i} and message word difference sequence {ΔW_i} components of the underlying differential characteristic Ω for this specific collision pair over 39 rounds. It shows how differences introduced in the message words propagate through the state variables (A-H). Note that the final state difference (after round 38, input to round 39's hypothetical DM add) is zero, confirming the collision. This table shows the *path*, but does not explicitly list the set G of constraint equations that needed to be satisfied for this path to hold.

Table A.2: Difference Propagation for a Known 39-Round SHA-256 Collision
Round (r)	ΔWr (W'r)	ΔAr+1 (A'r+1)	ΔBr+1 (B'r+1)	ΔCr+1 (C'r+1)	ΔDr+1 (D'r+1)	ΔEr+1 (E'r+1)	ΔFr+1 (F'r+1)	ΔGr+1 (G'r+1)	ΔHr+1 (H'r+1)
0	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000
1	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000
2	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000
3	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000
4	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000
5	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000
6	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000
7	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000
8	10000000	00000000	00000000	00000000	00000000	10000000	00000000	00000000	00000000
9	10000400	00000400	00000000	00000000	00000000	10000000	10000000	00000000	00000000
10	10000000	00000400	00000400	00000000	00000000	10000400	10000000	10000000	00000000
11	018088EC	00000000	00000400	00000400	00000000	01808CEC	10000400	10000000	10000000
12	00040600	00000000	00000000	00000400	00000400	000402EC	01808CEC	10000400	10000000
13	00000000	00000000	00000000	00000000	00000400	000002EC	000402EC	01808CEC	10000400
14	00000000	00000000	00000000	00000000	00000000	00000000	000002EC	000402EC	01808CEC
15	00000000	00000000	00000000	00000000	00000000	00000000	00000000	000002EC	000402EC
16	10000000	00000000	00000000	00000000	00000000	10000000	00000000	00000000	000002EC
17	10000400	00000400	00000000	00000000	00000000	10000000	10000000	00000000	00000000
18	10000000	00000400	00000400	00000000	00000000	10000400	10000000	10000000	00000000
19	018088EC	00000000	00000400	00000400	00000000	01808CEC	10000400	10000000	10000000
20	00040600	00000000	00000000	00000400	00000400	000402EC	01808CEC	10000400	10000000
21	00000000	00000000	00000000	00000000	00000400	000002EC	000402EC	01808CEC	10000400
22	00000000	00000000	00000000	00000000	00000000	00000000	000002EC	000402EC	01808CEC
23	00000000	00000000	00000000	00000000	00000000	00000000	00000000	000002EC	000402EC
24	10000000	00000000	00000000	00000000	00000000	10000000	00000000	00000000	000002EC
25	10000400	00000400	00000000	00000000	00000000	10000000	10000000	00000000	00000000
26	10000000	00000400	00000400	00000000	00000000	10000400	10000000	10000000	00000000
27	018088EC	00000000	00000400	00000400	00000000	01808CEC	10000400	10000000	10000000
28	00040600	00000000	00000000	00000400	00000400	000402EC	01808CEC	10000400	10000000
29	00000000	00000000	00000000	00000000	00000400	000002EC	000402EC	01808CEC	10000400
30	00000000	00000000	00000000	00000000	00000000	00000000	000002EC	000402EC	01808CEC
31	00000000	00000000	00000000	00000000	00000000	00000000	00000000	000002EC	000402EC
32	10000000	00000000	00000000	00000000	00000000	10000000	00000000	00000000	000002EC
33	10000400	00000400	00000000	00000000	00000000	10000000	10000000	00000000	00000000
34	10000000	00000400	00000400	00000000	00000000	10000400	10000000	10000000	00000000
35	018088EC	00000000	00000400	00000400	00000000	01808CEC	10000400	10000000	10000000
36	00040600	00000000	00000000	00000400	00000400	000402EC	01808CEC	10000400	10000000
37	00000000	00000000	00000000	00000000	00000400	000002EC	000402EC	01808CEC	10000400
38	00000000	00000000	00000000	00000000	00000000	00000000	000002EC	000402EC	01808CEC
(39)	N/A	00000000	00000000	00000000	00000000	00000000	00000000	00000000	00000000