Better Think Thrice - Learning to Reason Causally with Double Counterfactual Consistency
LLMs are extremely good at statistics and quite bad at "true reasoning". No surprise.
There exists a gap in performance between tasks in which the LLM can rely on recall of statistical patterns—i.e., memorization—and tasks that require true reasoning ability.
This is most clearly seen with counterfactual questions — ones that challenge LLM's priors.
When presented with counterfactual questions, for instance—questions in which some information is provided that potentially challenges the LLM’s prior world state—models exhibit degraded performance compared to similar questions that do not contain counterfactual information.
They propose a inference-time method to evaluate and guide a model's ability to reason causally called double counterfactual consistency (DCC). I've tried this before independently.
DCC guides a model to (1) answer a standard reasoning question; (2) apply an intervention to produce a counterfactual version of that question; and (3) apply a second intervention that inverts the previous intervention, yielding a “double counterfactual” question that should have an answer equal to the original.
One advantage of this way is that model doesn't need to be right, just consistent.
An important advantage of DCC is that it does not rely on the inherent correctness of the model: that is, a model may be wrong on the original question but still display double counterfactual consistency if its answers to the original and double counterfactual questions agree (and vice versa)
This is critical in cases where strong base model performance may mask deficiencies in causal reasoning.
And they show that models with accuracy are not necessarily good at this.
We find that DCC is not strictly correlated with a model’s base performance: models with higher accuracy on factual reasoning tasks do not necessarily achieve higher DCC. This suggests that DCC captures a distinct property of model behavior, complementary to standard accuracy metrics.
They operationalize this as both inference-time measure to filter out responses (rejection sampling criterion), and as a post-training reward.
DCC provides a natural criterion for inference- time rejection sampling. Given a question, if the model’s an- swers to the original and double counterfactual prompts are inconsistent, the response can be rejected and the model re- sampled until double counterfactual consistency is achieved.
It turns out models quickly find agreement during inference if we keep probing, so using it as filter does work.
Moreover, we find that in practice, agreement is reached fairly quickly: across all of our experiments, a mean of only 3.97 attempts is required to achieve agreement.
And use it for rewards:
By guiding the model through the DCC reasoning steps and rewarding instances where consistency holds, the model can be explicitly incentivized to learn both how to perform interventions and how to model counterfactual outcomes, reinforcing representations that capture underlying causal structure.
But its susceptible to shortcuts when everything done in one call, as expected:
However, we note that this implementation may also allow the model to learn certain shortcuts when DCC is used as a post-training reward. Because the model has simultaneous access to its predictions for both the original answer and the double counterfactual answer, it may learn simply that it is rewarded when the two answers are the same. We find that early stopping during training is key to preventing the model from overfitting and exploiting this shortcut.
Why do LLMs struggle with counterfactual questions
For example, training data may contain factual questions such as: Q: Is 6 divisible by 2? A: Yes
If we instead pose a counterfactual question with an intervention that changes the premise, such as: Q: Suppose 6 is a prime number. Then is 6 divisible by 2?
Change to the world state (intervention) is difficult, but then staying consistent with this change (counterfactual) is even more difficult.
The model encounters two distinct challenges. First, it must correctly interpret the intervention itself—understanding that the question now operates in a modified world state. Second, it must infer the consequence of this intervention, i.e., the counterfactual outcome under the new assumptions
Creating counterfactual datasets is hard so models trained on observations are going to have this problem by default.
Counterfactual datasets are therefore useful diagnostic tools for exposing failures in causal reasoning—but constructing them at scale remains difficult. While limited counterfactual data can be generated programmatically,
DCC diagnoses this ability by checking two things:
The model will find the two answers to be equivalent if it can
(i) correctly apply causal interventions, as the model must not only intervene but also act to undo that intervention, and
(ii) model the counterfactual outcome distribution, since the correctness of the double counterfactual depends on the correctness of the intermediate counterfactual