Skip to main content

Concept

The image features layered structural elements, representing diverse liquidity pools and market segments within a Principal's operational framework. A sharp, reflective plane intersects, symbolizing high-fidelity execution and price discovery via private quotation protocols for institutional digital asset derivatives, emphasizing atomic settlement nodes

The Allure of Causal Inference in Strategy Validation

In the quantitative evaluation of trading strategies, the pursuit of causal inference represents a foundational challenge. The core question is whether a specific strategy (the “treatment”) causes an observable outcome, such as superior risk-adjusted returns, or if the observed performance is merely a product of selection bias. A strategy might be deployed more frequently on assets that were already poised to perform well, creating a misleading correlation between the strategy’s application and its apparent success.

Propensity Score Matching (PSM) entered the financial econometrician’s toolkit as a proposed method to untangle this knot, offering a statistical technique to approximate the conditions of a randomized experiment within the confines of observational market data. The objective of PSM is to create a synthetic control group by matching “treated” instances (e.g. trades executed with a specific algorithmic strategy) with “untreated” instances (e.g. trades executed with a different strategy or a benchmark) that share a similar probability, or propensity, of having been treated.

This probability is encapsulated in a single scalar value, the propensity score, which is typically derived from a logistic regression model based on a set of observable pre-treatment characteristics, known as covariates. In the context of trading, these covariates could include variables like market volatility, order book depth, an asset’s momentum, or the time of day. By matching on this single score, PSM aims to balance the distribution of these observed covariates between the treated and control groups, thereby reducing the bias that would otherwise confound the performance evaluation. The promise is a cleaner, more reliable estimate of the strategy’s true effect, isolated from the market conditions that might have prompted its use in the first place.

This approach is intellectually appealing because it offers a structured, quantitative framework for addressing a problem that has long plagued performance analysis. It suggests a pathway to move from simple correlation to a more robust, causal understanding of a strategy’s value.

Propensity Score Matching seeks to mitigate selection bias in non-experimental data by creating a balanced comparison group based on the estimated probability of receiving a treatment.
A transparent glass sphere rests precisely on a metallic rod, connecting a grey structural element and a dark teal engineered module with a clear lens. This symbolizes atomic settlement of digital asset derivatives via private quotation within a Prime RFQ, showcasing high-fidelity execution and capital efficiency for RFQ protocols and liquidity aggregation

The Theoretical Underpinnings of Propensity Score Matching

The theoretical justification for PSM rests on two critical assumptions. The first is the “unconfoundedness” or “selection on observables” assumption, which posits that, conditional on the observed covariates, the assignment of the treatment is effectively random. In trading terms, this means that all the factors influencing both the decision to use a particular strategy and its potential outcome are captured in the data.

The second is the “common support” or “overlap” assumption, which requires that for any given set of covariate values, there is a non-zero probability of being both treated and untreated. This ensures that for any treated trade, there exists a pool of potential control trades with similar characteristics from which to draw a match.

When these assumptions hold, the propensity score has a powerful balancing property ▴ conditioning on the propensity score is sufficient to balance the distribution of all observed covariates between the treated and control groups. This dimensional reduction is the primary appeal of PSM. Instead of attempting to match on a potentially large and complex set of individual covariates, the analyst can match on a single, continuous variable. This simplifies the matching process considerably and appears to offer an elegant solution to a multidimensional problem.

The result, in theory, is a matched dataset where the only systematic difference between the two groups is the application of the trading strategy, allowing for a more direct and unbiased comparison of their outcomes. However, the practical application of this theory in the complex and dynamic environment of financial markets reveals significant limitations that can undermine the validity of its conclusions.


Strategy

A luminous, miniature Earth sphere rests precariously on textured, dark electronic infrastructure with subtle moisture. This visualizes institutional digital asset derivatives trading, highlighting high-fidelity execution within a Prime RFQ

The Flawed Experimental Ideal

A primary strategic limitation of Propensity Score Matching stems from the type of experiment it seeks to approximate. PSM attempts to replicate a completely randomized experiment, where treatment is assigned based solely on a known probability. In contrast, other matching methods, such as Mahalanobis Distance Matching (MDM), aim to approximate a more powerful and efficient design ▴ the fully blocked randomized experiment. In a fully blocked design, units are first grouped into blocks with identical covariate values, and then randomization is performed within each block.

This process eliminates all in-sample bias related to the observed covariates by design. A completely randomized experiment, on the other hand, only controls for this bias on average across many hypothetical repetitions of the experiment. For the single dataset that a trading analyst possesses, significant imbalances can and do remain.

This distinction is far from academic. By reducing a multidimensional set of covariates to a single propensity score, PSM becomes blind to the underlying information contained within those covariates. Two trading scenarios with identical propensity scores can have vastly different underlying market conditions. For example, a high-volatility, low-liquidity environment might produce the same propensity score for deploying a mean-reversion strategy as a low-volatility, high-liquidity environment, if the logistic regression model is specified in a certain way.

Matching these two scenarios obscures critical information and fails to eliminate the very imbalances the method is meant to address. Other matching methods that operate on the full covariate space are better able to preserve this information and achieve a more precise balance, approximating the superior experimental ideal of a fully blocked design.

Abstract geometric design illustrating a central RFQ aggregation hub for institutional digital asset derivatives. Radiating lines symbolize high-fidelity execution via smart order routing across dark pools

The Propensity Score Paradox and Model Dependence

A more damaging critique of PSM is what has been termed the “Propensity Score Paradox.” This paradox, identified by King and Nielsen (2019), reveals that in datasets that are already reasonably well-balanced, or as they become more balanced through the process of matching and pruning, PSM can actually increase imbalance, inefficiency, and model dependence. When the data is balanced, the propensity scores for the treated and control groups will naturally overlap significantly. In this situation, PSM begins to approximate random matching, as the small differences in propensity scores provide little information for creating meaningful pairs. This random pruning of observations can degrade the quality of the matches and reintroduce bias.

This paradox is deeply intertwined with the problem of model dependence. The calculation of the propensity score is entirely contingent on the specification of the underlying statistical model, typically a logistic regression. The choice of which covariates to include, and in what form (e.g. linear terms, interactions, polynomials), is left to the discretion of the analyst. This flexibility can lead to a wide range of different propensity scores from the same underlying data, each resulting in a different matched sample and, consequently, a different estimate of the trading strategy’s effect.

This creates a significant risk of “p-hacking” or specification searching, where an analyst, perhaps unconsciously, selects the model that produces the most favorable or statistically significant result. The very process designed to reduce bias becomes a potential source of it.

The reliance on a single, model-dependent score can paradoxically increase imbalance and introduces a significant risk of researcher-induced bias.

The strategic implication for evaluating trading systems is profound. A backtest or performance analysis that relies on PSM may produce results that are not robust to minor changes in the model specification. This fragility undermines the confidence that can be placed in the evaluation.

An effective evaluation framework must seek to minimize, not amplify, the impact of subjective analytical choices. Methods that match directly on covariates, while not immune to researcher discretion, are less susceptible to the compounding errors introduced by an intermediate, and potentially misspecified, statistical model.

Comparative Analysis Of Matching Methodologies
Methodology Core Mechanism Strengths In Trading Context Limitations In Trading Context
Propensity Score Matching (PSM) Matches on a single scalar probability (the propensity score) derived from a model of the treatment assignment. Simplicity in application; handles a large number of covariates through dimensional reduction. Highly sensitive to model specification; can increase imbalance (PSM Paradox); ignores information within the full covariate space.
Mahalanobis Distance Matching (MDM) Matches on a multivariate distance metric that accounts for the correlations between covariates. Preserves the full information of the covariate space; generally achieves better balance than PSM. Can be computationally intensive with many covariates; may give undue weight to less important but highly variable covariates.
Coarsened Exact Matching (CEM) Temporarily coarsens covariates into bins, performs exact matching on these bins, and then uses the original data for analysis. Guarantees balance on the coarsened covariates; reduces model dependence; not reliant on a propensity score model. The choice of coarsening levels is subjective; can lead to a significant reduction in sample size if covariates are not coarsened appropriately.
Inverse Probability Weighting (IPW) Uses propensity scores to weight the control group to match the covariate distribution of the treated group, rather than discarding units. Retains all observations, increasing statistical efficiency; avoids the “paradox” associated with PSM. Can be sensitive to very small propensity scores (extreme weights), which can inflate variance; still dependent on correct model specification.


Execution

A central, symmetrical, multi-faceted mechanism with four radiating arms, crafted from polished metallic and translucent blue-green components, represents an institutional-grade RFQ protocol engine. Its intricate design signifies multi-leg spread algorithmic execution for liquidity aggregation, ensuring atomic settlement within crypto derivatives OS market microstructure for prime brokerage clients

A Practical Demonstration of PSM Failure

To illustrate the practical consequences of PSM’s limitations, consider a simplified scenario for evaluating a high-frequency trading (HFT) strategy. The “treatment” is the application of the HFT algorithm, and the “control” is a standard volume-weighted average price (VWAP) execution. We have two key covariates ▴ pre-trade market volatility (Vol) and order book liquidity (Liq).

A logistic regression is used to calculate the propensity score for using the HFT strategy. The table below shows the pre-matching data and the results after applying PSM.

Hypothetical PSM Results For HFT Strategy Evaluation
Group Mean Propensity Score Mean Volatility (Pre-Trade) Mean Liquidity (Pre-Trade)
Pre-Matching Data
HFT (Treated) 0.75 0.025 500,000
VWAP (Control) 0.35 0.015 1,500,000
Post-PSM Data
HFT (Treated) 0.75 0.025 500,000
VWAP (Matched Control) 0.74 0.028 400,000

In this example, PSM successfully created a control group with a mean propensity score very close to the treated group. An analyst looking only at this metric might conclude that the matching was successful. However, a closer look at the individual covariates reveals a failure. The matched control group now has a higher average volatility and lower average liquidity than the treated group, worsening the imbalance on these critical dimensions.

The evaluation would now be comparing the HFT strategy’s performance in its typical environment to the VWAP strategy’s performance in an even more challenging environment, likely leading to a biased and inflated estimate of the HFT strategy’s effectiveness. This illustrates how PSM can provide a false sense of security while actively degrading the quality of the comparison.

Sleek metallic structures with glowing apertures symbolize institutional RFQ protocols. These represent high-fidelity execution and price discovery across aggregated liquidity pools

A More Robust Evaluation Protocol

A superior approach to strategy evaluation requires moving beyond a simplistic reliance on PSM. The following protocol outlines a more rigorous, multi-step process for causal inference in a trading context.

  1. Covariate Selection and Justification ▴ The first step involves a careful selection of covariates based on market microstructure theory and domain expertise. Every variable included in the analysis should have a clear, justifiable link to both the strategy selection process and the potential outcome. This process should be documented before the analysis begins to reduce the risk of data-driven model selection.
  2. Assess Pre-Matching Imbalance ▴ Before any matching is performed, the initial level of imbalance between the treated and control groups should be thoroughly assessed. This involves not just comparing means, but also examining higher moments of the distributions (e.g. variance, skewness) and visualizing the data through plots and histograms.
  3. Employ Multiple Matching Methods ▴ Instead of relying solely on PSM, analysts should employ a variety of matching methods, including Mahalanobis Distance Matching and Coarsened Exact Matching. Comparing the results across different methods provides a crucial test of the robustness of the findings. If the estimated effect of the strategy varies significantly depending on the matching method used, the results should be treated with extreme caution.
  4. Post-Matching Balance Diagnostics ▴ After matching, a comprehensive set of balance diagnostics must be performed on the full set of covariates. It is insufficient to simply check for balance on the propensity score. Standardized mean differences are a common metric, with a value below 0.1 often considered a threshold for good balance. These checks must be performed for all covariates individually.
  5. Sensitivity Analysis ▴ No matching method can account for unobserved confounders. A sensitivity analysis should be conducted to assess how strong the effect of an unobserved variable would need to be to alter the conclusions of the study. This provides a quantitative measure of the robustness of the results to potential hidden biases.
A rigorous evaluation framework must prioritize direct covariate balance and test the sensitivity of its conclusions to alternative methodological choices.
Precision-engineered multi-layered architecture depicts institutional digital asset derivatives platforms, showcasing modularity for optimal liquidity aggregation and atomic settlement. This visualizes sophisticated RFQ protocols, enabling high-fidelity execution and robust pre-trade analytics

Essential Diagnostic Checks for Any Matching Application

Regardless of the specific matching method employed, a series of diagnostic checks is essential to ensure the validity of the results. These checks provide the necessary evidence that the matching process has created a dataset suitable for causal inference.

  • Common Support Assessment ▴ This involves visualizing the distribution of propensity scores (or individual covariates for other methods) for both the treated and control groups. A lack of significant overlap indicates that the groups are too dissimilar to be meaningfully compared, and any conclusions drawn from the matched sample would rely heavily on extrapolation.
  • Standardized Mean Difference Plots ▴ These plots, often called “Love plots,” provide a visual summary of the covariate balance before and after matching. For each covariate, the standardized mean difference between the treated and control groups is plotted. A successful matching procedure should result in post-match differences that are close to zero for all covariates.
  • Empirical Quantile-Quantile (Q-Q) Plots ▴ Comparing the means of covariates is often not enough. Q-Q plots can be used to compare the entire distribution of each covariate between the treated and matched control groups, providing a more granular assessment of balance.
  • Post-Matching Outcome Model Specification ▴ After achieving satisfactory balance, the treatment effect is estimated using the matched data. The choice of the outcome model itself should be justified. While a simple difference in means is common, a regression-adjusted approach that includes the covariates can improve precision and account for any residual imbalance.

A sophisticated, multi-layered trading interface, embodying an Execution Management System EMS, showcases institutional-grade digital asset derivatives execution. Its sleek design implies high-fidelity execution and low-latency processing for RFQ protocols, enabling price discovery and managing multi-leg spreads with capital efficiency across diverse liquidity pools

References

  • King, Gary, and Richard Nielsen. “Why Propensity Scores Should Not Be Used for Matching.” Political Analysis, vol. 27, no. 4, 2019, pp. 435-454.
  • Abadie, Alberto, and Guido W. Imbens. “Matching on the Estimated Propensity Score.” Econometrica, vol. 84, no. 2, 2016, pp. 781-807.
  • Rosenbaum, Paul R. and Donald B. Rubin. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika, vol. 70, no. 1, 1983, pp. 41-55.
  • Stuart, Elizabeth A. “Matching Methods for Causal Inference ▴ A Review and a Look Forward.” Statistical Science, vol. 25, no. 1, 2010, pp. 1-21.
  • Ho, Daniel E. et al. “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.” Political Analysis, vol. 15, no. 3, 2007, pp. 199-236.
  • Dehejia, Rajeev H. and Sadek Wahba. “Causal Effects in Nonexperimental Studies ▴ Reevaluating the Evaluation of Training Programs.” Journal of the American Statistical Association, vol. 94, no. 448, 1999, pp. 1053-1062.
  • Iacus, Stefano M. Gary King, and Giuseppe Porro. “Causal Inference without Balance Checking ▴ Coarsened Exact Matching.” Political Analysis, vol. 20, no. 1, 2012, pp. 1-24.
  • Rubin, Donald B. “Matching Using Estimated Propensity Scores ▴ Relating Theory to Practice.” Biometrics, vol. 52, no. 1, 1996, pp. 249-264.
A centralized RFQ engine drives multi-venue execution for digital asset derivatives. Radial segments delineate diverse liquidity pools and market microstructure, optimizing price discovery and capital efficiency

Reflection

A sophisticated metallic instrument, a precision gauge, indicates a calibrated reading, essential for RFQ protocol execution. Its intricate scales symbolize price discovery and high-fidelity execution for institutional digital asset derivatives

Beyond a Single Metric of Validity

The limitations of Propensity Score Matching serve as a critical reminder that no single statistical tool can be a panacea for the complex challenge of causal inference in financial markets. The allure of reducing a high-dimensional problem to a single, elegant score can obscure the nuances and potential pitfalls that lie beneath the surface. A robust system for evaluating trading strategies cannot be built upon a foundation of methodological shortcuts or a blind faith in any one technique. Instead, it requires a commitment to a multi-faceted validation process, one that embraces transparency, tests for robustness, and acknowledges the inherent uncertainties of observational data.

The true value of any analytical framework lies not in its ability to produce a single, definitive answer, but in its capacity to illuminate the conditions under which a conclusion holds and the degree of confidence that can be placed in it. Moving beyond PSM toward a more comprehensive suite of tools ▴ including direct covariate matching, sensitivity analyses, and rigorous diagnostic checks ▴ is a step toward building a more resilient and intellectually honest operational framework. The ultimate goal is to construct a system of intelligence where the methods of evaluation are as rigorously vetted as the strategies themselves, ensuring that decisions are based on a deep, systemic understanding of performance, rather than the artifacts of a flawed statistical model.

A sleek, multi-component device with a dark blue base and beige bands culminates in a sophisticated top mechanism. This precision instrument symbolizes a Crypto Derivatives OS facilitating RFQ protocol for block trade execution, ensuring high-fidelity execution and atomic settlement for institutional-grade digital asset derivatives across diverse liquidity pools

Glossary

A precise, multi-faceted geometric structure represents institutional digital asset derivatives RFQ protocols. Its sharp angles denote high-fidelity execution and price discovery for multi-leg spread strategies, symbolizing capital efficiency and atomic settlement within a Prime RFQ

Causal Inference

Meaning ▴ Causal Inference represents the analytical discipline of establishing definitive cause-and-effect relationships between variables, moving beyond mere observed correlations to identify the true drivers of an outcome.
The abstract metallic sculpture represents an advanced RFQ protocol for institutional digital asset derivatives. Its intersecting planes symbolize high-fidelity execution and price discovery across complex multi-leg spread strategies

Selection Bias

Meaning ▴ Selection bias represents a systemic distortion in data acquisition or observation processes, resulting in a dataset that does not accurately reflect the underlying population or phenomenon it purports to measure.
Abstract visualization of an institutional-grade digital asset derivatives execution engine. Its segmented core and reflective arcs depict advanced RFQ protocols, real-time price discovery, and dynamic market microstructure, optimizing high-fidelity execution and capital efficiency for block trades within a Principal's framework

Propensity Score Matching

Meaning ▴ Propensity Score Matching is a statistical methodology designed to reduce selection bias in observational studies by constructing a pseudo-randomized experimental design from non-randomized data.
Intricate metallic mechanisms portray a proprietary matching engine or execution management system. Its robust structure enables algorithmic trading and high-fidelity execution for institutional digital asset derivatives

Randomized Experiment

A controlled experiment to compare dark pool leakage profiles requires a meticulously structured A/B test with a control group.
Four sleek, rounded, modular components stack, symbolizing a multi-layered institutional digital asset derivatives trading system. Each unit represents a critical Prime RFQ layer, facilitating high-fidelity execution, aggregated inquiry, and sophisticated market microstructure for optimal price discovery via RFQ protocols

Logistic Regression

Meaning ▴ Logistic Regression is a statistical classification model designed to estimate the probability of a binary outcome by mapping input features through a sigmoid function.
Precision mechanics illustrating institutional RFQ protocol dynamics. Metallic and blue blades symbolize principal's bids and counterparty responses, pivoting on a central matching engine

Observed Covariates

Counterparty selection in an RFQ directly engineers quote dispersion by balancing the competitive tension of a wide auction against the information risk of each additional participant.
A luminous teal sphere, representing a digital asset derivative private quotation, rests on an RFQ protocol channel. A metallic element signifies the algorithmic trading engine and robust portfolio margin

Unconfoundedness

Meaning ▴ Unconfoundedness denotes a state where the assignment of a treatment or intervention, such as a specific execution algorithm or trading strategy, is independent of potential outcomes given a set of observed covariates.
A central teal sphere, secured by four metallic arms on a circular base, symbolizes an RFQ protocol for institutional digital asset derivatives. It represents a controlled liquidity pool within market microstructure, enabling high-fidelity execution of block trades and managing counterparty risk through a Prime RFQ

Common Support

Meaning ▴ Common Support defines a foundational technical and operational layer designed to provide standardized, shared services across multiple execution protocols and strategic modules within an institutional digital asset derivatives ecosystem.
A sophisticated mechanism features a segmented disc, indicating dynamic market microstructure and liquidity pool partitioning. This system visually represents an RFQ protocol's price discovery process, crucial for high-fidelity execution of institutional digital asset derivatives and managing counterparty risk within a Prime RFQ

Propensity Score

Propensity Score Matching creates a fair RFQ comparison by statistically controlling for order and market variables, isolating true provider performance.
A sophisticated metallic mechanism with integrated translucent teal pathways on a dark background. This abstract visualizes the intricate market microstructure of an institutional digital asset derivatives platform, specifically the RFQ engine facilitating private quotation and block trade execution

Control Groups

Crisis Management Groups are the cross-border command structures designed to execute the orderly resolution of a systemic central counterparty.
Precision metallic bars intersect above a dark circuit board, symbolizing RFQ protocols driving high-fidelity execution within market microstructure. This represents atomic settlement for institutional digital asset derivatives, enabling price discovery and capital efficiency

Mahalanobis Distance Matching

Meaning ▴ Mahalanobis Distance Matching represents a statistical method for quantifying the dissimilarity between a point and a distribution, or between two distributions, by accounting for the covariance structure of the data.
A golden rod, symbolizing RFQ initiation, converges with a teal crystalline matching engine atop a liquidity pool sphere. This illustrates high-fidelity execution within market microstructure, facilitating price discovery for multi-leg spread strategies on a Prime RFQ

Matching Methods

A multi-maker engine mitigates the winner's curse by converting execution into a competitive auction, reducing information asymmetry.
Polished metallic rods, spherical joints, and reflective blue components within beige casings, depict a Crypto Derivatives OS. This engine drives institutional digital asset derivatives, optimizing RFQ protocols for high-fidelity execution, robust price discovery, and capital efficiency within complex market microstructure via algorithmic trading

Propensity Scores

Dependency-based scores provide a stronger signal by modeling the logical relationships between entities, detecting systemic fraud that proximity models miss.
A light sphere, representing a Principal's digital asset, is integrated into an angular blue RFQ protocol framework. Sharp fins symbolize high-fidelity execution and price discovery

Propensity Score Paradox

Meaning ▴ The Propensity Score Paradox describes a counter-intuitive phenomenon in causal inference where conditioning on a variable, such as a propensity score, can inadvertently increase bias in the estimation of a treatment effect rather than reduce it.
A pristine teal sphere, representing a high-fidelity digital asset, emerges from concentric layers of a sophisticated principal's operational framework. These layers symbolize market microstructure, aggregated liquidity pools, and RFQ protocol mechanisms ensuring best execution and optimal price discovery within an institutional-grade crypto derivatives OS

Model Dependence

Meaning ▴ Model Dependence describes the critical reliance of a financial outcome, such as a valuation, risk measure, or strategic decision, on the specific quantitative model employed for its derivation.
A precise RFQ engine extends into an institutional digital asset liquidity pool, symbolizing high-fidelity execution and advanced price discovery within complex market microstructure. This embodies a Principal's operational framework for multi-leg spread strategies and capital efficiency

Model Specification

An organization quantitatively assesses specification risk by modeling the ambiguity of its RFQ against market conditions and operational capacity.
Metallic, reflective components depict high-fidelity execution within market microstructure. A central circular element symbolizes an institutional digital asset derivative, like a Bitcoin option, processed via RFQ protocol

Control Group

Meaning ▴ A Control Group represents a baseline configuration or a set of operational parameters that remain unchanged during an experiment or system evaluation, serving as the standard against which the performance or impact of a new variable, protocol, or algorithmic modification is rigorously measured.
A sleek, conical precision instrument, with a vibrant mint-green tip and a robust grey base, represents the cutting-edge of institutional digital asset derivatives trading. Its sharp point signifies price discovery and best execution within complex market microstructure, powered by RFQ protocols for dark liquidity access and capital efficiency in atomic settlement

Coarsened Exact Matching

Meaning ▴ Coarsened Exact Matching is a robust non-parametric preprocessing methodology specifically engineered for causal inference within observational studies.
A precise mechanism interacts with a reflective platter, symbolizing high-fidelity execution for institutional digital asset derivatives. It depicts advanced RFQ protocols, optimizing dark pool liquidity, managing market microstructure, and ensuring best execution

Covariate Balance

Meaning ▴ Covariate balance refers to the statistical equivalence of the distribution of pre-intervention characteristics, or covariates, across different comparison groups within an analytical framework.
Angular teal and dark blue planes intersect, signifying disparate liquidity pools and market segments. A translucent central hub embodies an institutional RFQ protocol's intelligent matching engine, enabling high-fidelity execution and precise price discovery for digital asset derivatives, integral to a Prime RFQ

Score Matching

Propensity Score Matching creates a fair RFQ comparison by statistically controlling for order and market variables, isolating true provider performance.