CUPID: Curating Data your Robot Loves 💘
with Influence Functions

Christopher Agia¹, Rohan Sinha¹, Jingyun Yang¹,
Rika Antonova², Marco Pavone^1,3, Haruki Nishimura⁴, Masha Itkina⁴, Jeannette Bohg¹

¹Stanford University, ²University of Cambridge, ³NVIDIA Research, ⁴Toyota Research Institute

arXiv Video Tweet Coming soon!

TL;DR

CUPID (CUrating Performance-Influencing Demonstrations) is a data curation method for robot imitation learning that uses influence functions to estimate the causal impact of each demonstration on a policy’s closed-loop performance. I.e., CUPID provides a principled way to (i) 🧹 filter existing training sets and (ii) 🎯 subselect helpful demos from new data.

Video

Abstract

In robot imitation learning, policy performance is tightly coupled with the quality and composition of the demonstration data. Yet, developing a precise understanding of how individual demonstrations contribute to downstream outcomes - such as closed-loop task success or failure - remains a persistent challenge. We propose CUPID, a robot data curation method based on a novel influence function-theoretic formulation for imitation learning policies. Given a set of evaluation rollouts, CUPID estimates the influence of each training demonstration on the policy’s expected return. This enables ranking and selection of demonstrations according to their impact on the policy’s closed-loop performance. We use CUPID to curate data by 1) filtering out training demonstrations that harm policy performance and 2) subselecting newly collected trajectories that will most improve the policy. Extensive simulated and hardware experiments show that our approach consistently identifies which data drives test-time performance. For example, training with less than 33% of curated data can yield state-of-the-art diffusion policies on the simulated RoboMimic benchmark, with similar gains observed in hardware. Furthermore, hardware experiments show that our method can identify robust strategies under distribution shift, isolate spurious correlations, and even enhance the post-training of generalist robot policies.

Approach

Curation Overview. Upon estimating a BC policy's expected return from a set of rollouts, CUPID ranks demos based on their measured influence on this performance estimate and selects the top-k. Thus, curating with CUPID results in a dataset of demos that most strongly influences closed-loop success.

Robot Experiments

Improving Diffusion Policies by Debugging Training Data

Franka real-world diffusion policy performance. CUPID, which curates demonstrations w.r.t. policy performance, improves success rates on mixed-quality datasets, identifies robust strategies, and disentangles spurious correlations that hinder performance. Although quality measures (e.g., DemInf, CUPID-Quality) help in mixed-quality settings (Figure-8), they degrade performance when higher-quality demonstrations induce brittle strategies at test time (TuckBox), or when quality is not the primary factor limiting policy success (Bookshelf). Overall, curating data based on performance (CUPID) maintains robustness across these settings. All data curation methods are limited to using 25 policy rollouts. Success rates are averaged over 25 rollouts.

Curated Post-Training for Vision-Language-Action (VLA) Models

Transferring Curated Datasets from Single-task Policies to Multi-task VLAs

Franka real-world \( \mathbf{\pi_0} \) VLA performance. Data curated for single-task diffusion policies improves \( \mathbf{\pi_0} \) [1] post-training performance. This highlights a promising direction to alleviate the computational cost of CUPID in large-scale settings: use smaller, single-task policies to curate datasets for larger, multi-task VLAs. The results also suggest that scaling the pre-training of VLAs does not inherently enable them to leverage their generalist knowledge to, e.g., ignore low-quality behaviors or brittle strategies in demonstration data. That is, data curation still appears important for VLA post-training.

Diffusion Policy Rollouts

Figure-8: Improving Policy Performance in Mixed-Quality Regimes

Base Policy - All Demos

Demo-SCORE [2]

DemInf [3]

CUPID (Ours)

TuckBox: Identifying Robust Test-Time Strategies from Policy Failures

Base Policy - All Demos

Demo-SCORE [2]

DemInf [3]

CUPID (Ours)

Bookshelf: Disentangling Spurious Correlations in Demonstration Data

Base Policy - All Demos

Demo-SCORE [2]

DemInf [3]

CUPID (Ours)

\( \mathbf{\pi_0} \) VLA Rollouts

Figure-8: VLA Performance in Mixed-Quality Regimes

Fine-Tuned Policy - All Demos

CUPID (Ours)

TuckBox: VLA Strategies under Test-Time Distribution Shifts

Fine-Tuned Policy - All Demos

CUPID (Ours)

Frequently Asked Questions

Intuitively, CUPID uses influence functions to predictively answer counterfactual questions about the effect of each demonstration on downstream policy performance. For example, CUPID enables us to answer questions like "How would my policy's expected return change if I upweighted - or by negating, downweighted - the loss on a demonstration during training?" - and it does so without needing to retrain any policies (i.e., predictively). An illustrative example of this is shown in the figure below, where we can see how answering such counterfactual questions can help us understand which demonstrations are helpful or harmful to the policy's performance.

CUPID Overview: Predictively answering counterfactual questions about downstream policy performance using influence functions.

Franka diffusion policy curated dataset distributions for dataset filtering. CUPID filters out lower-quality demonstrations (Figure-8), brittle strategies (TuckBox), and spuriously correlated examples (Bookshelf), improving policy performance across tasks. While curation heuristics employed by baselines may be effective in some cases (e.g., DemInf and CUPID-Quality in Figure-8), they can lead to suboptimal pruning in others. We observe similar patterns in the data selection setting, where CUPID is used to select the most helpful demos from a new dataset (see paper for details).

Franka Diffusion Policy Figure-8 Distribution

(a) Figure-8: Distribution of curated demonstrations after filtering 66%. Higher-quality demos are better.

Franka Diffusion Policy TuckBox Distribution

(b) TuckBox: Distribution of curated demonstrations after filtering 66%. Pick-and-place demos are better.

Franka Diffusion Policy Bookshelf Distribution

(c) Bookshelf: Distribution of curated demonstrations after filtering 50%. Balanced data is better.

CUPID uses a REINFORCE-style estimator to compute the performance influence of each demonstration for curation. Thus, the accuracy of estimated performance influences is dependent on the number of policy rollouts. While REINFORCE often yields high-variance gradient estimates under limited rollout budgets, e.g., in RL contexts, we highlight that our curation objective imposes a lower fidelity requirement: since curation with CUPID involves top-k selection, it suffices to rank helpful demonstrations above harmful ones (requiring fewer rollouts) rather than to estimate performance influence precisely (requiring many rollouts). As shown below, the ranking of demonstrations stabilizes with approximately \( m \in [25, 50] \) rollouts on "Lift MH" and "Square MH," and \( m \in [50, 100] \) rollouts on "Transport MH." Similarly, we use only \( m = 25 \) rollouts for our real-world Franka tasks.

RoboMimic Ablation on Number of Rollouts

RoboMimic ablation: Data quality trends under varying number of rollout trajectories using image-based diffusion policies.

BibTeX

@article{agia2025cupid,
  author       = {Agia, Christopher and Sinha, Rohan and Yang, Jingyun and Antonova, Rika and Pavone, Marco and Nishimura, Haruki and Itkina, Masha and Bohg, Jeannette},
  title        = {CUPID: Curating Data Your Robot Loves with Influence Functions},
  journal      = {arXiv preprint arXiv:2506.19121},
  year         = {2025}
}

References

[1] Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., ... & Zhilinsky, U. (2024). π₀: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164.

[2] Chen, A. S., Lessing, A. M., Liu, Y., & Finn, C. (2025). Curating demonstrations using online experience. arXiv preprint arXiv:2503.03707.

[3] Hejna, J., Mirchandani, S., Balakrishna, A., Xie, A., Wahid, A., Tompson, J., ... & Sadigh, D. (2025). Robot Data Curation with Mutual Information Estimators. arXiv preprint arXiv:2502.08623.

CUPID: Curating Data your Robot Loves 💘with Influence Functions

TL;DR

Video

Abstract

Approach

Curation Overview. Upon estimating a BC policy's expected return from a set of rollouts, CUPID ranks demos based on their measured influence on this performance estimate and selects the top-k. Thus, curating with CUPID results in a dataset of demos that most strongly influences closed-loop success.

Robot Experiments

Improving Diffusion Policies by Debugging Training Data

Curated Post-Training for Vision-Language-Action (VLA) Models

Transferring Curated Datasets from Single-task Policies to Multi-task VLAs

Diffusion Policy Rollouts

Figure-8: Improving Policy Performance in Mixed-Quality Regimes

TuckBox: Identifying Robust Test-Time Strategies from Policy Failures

Bookshelf: Disentangling Spurious Correlations in Demonstration Data

\( \mathbf{\pi_0} \) VLA Rollouts

Figure-8: VLA Performance in Mixed-Quality Regimes

TuckBox: VLA Strategies under Test-Time Distribution Shifts

Frequently Asked Questions

BibTeX

References

CUPID: Curating Data your Robot Loves 💘
with Influence Functions