Publications

Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping

Published in arXiv, 2025

Modern foundation models often undergo iterative ``bootstrapping’’ in their post-training phase: a model generates synthetic data, an external verifier filters out low-quality samples, and the high-quality subset is used for further fine-tuning. Over multiple iterations, the model’s performance improves—raising a crucial question: how should the total budget on generation and training be allocated across iterations to maximize final performance? In this work, we develop a theoretical framework to analyze budget allocation strategies. Specifically, we show that constant policies fail to converge with high probability, while increasing policies—particularly exponential growth policies—exhibit significant theoretical advantages. Experiments on image denoising with diffusion probabilistic models and math reasoning with large language models show that both exponential and polynomial growth policies consistently outperform constant approaches, with exponential policies often providing more stable performance.

Recommended citation: Yang, P., Feng, Y., Chen, Z., Wu, Y., & Li, Z. (2025). Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping. arXiv preprint arXiv:2501.18962 https://arxiv.org/pdf/2501.18962

State-observation augmented diffusion model for nonlinear assimilation

Published in arXiv, 2024

Data assimilation has become a crucial technique aiming to combine physical models with observational data to estimate state variables. Traditional assimilation algorithms often face challenges of high nonlinearity brought by both the physical and observational models. In this work, we propose a novel data-driven assimilation algorithm based on generative models to address such concerns. Our State-Observation Augmented Diffusion (SOAD) model is designed to handle nonlinear physical and observational models more effectively. The marginal posterior associated with SOAD has been derived and then proved to match the real posterior under mild assumptions, which shows theoretical superiority over previous score-based assimilation works. Experimental results also indicate that our SOAD model may offer improved accuracy over existing data-driven methods.

Recommended citation: Li, Z., Dong, B., & Zhang, P. (2024). State-observation augmented diffusion model for nonlinear assimilation. arXiv preprint arXiv:2407.21314 https://arxiv.org/pdf/2407.21314.pdf

Latent assimilation with implicit neural representations for unknown dynamics

Published in Journal of Computational Physics, 2024

Data assimilation is crucial in a wide range of applications, but it often faces challenges such as high computational costs due to data dimensionality and incomplete understanding of underlying mechanisms. To address these challenges, this study presents a novel assimilation framework, termed Latent Assimilation with Implicit Neural Representations (LAINR). By introducing Spherical Implicit Neural Representations (SINR) along with a data-driven uncertainty estimator of the trained neural networks, LAINR enhances efficiency in assimilation process. Experimental results indicate that LAINR holds certain advantage over existing methods based on AutoEncoders, both in terms of accuracy and efficiency.

Recommended citation: Li, Z., Dong, B., & Zhang, P. (2024). Latent assimilation with implicit neural representations for unknown dynamics. Journal of Computational Physics, page 112953 https://doi.org/10.1016/j.jcp.2024.112953

Learning to simulate partially known spatio-temporal dynamics with trainable difference operators

Published in arXiv, 2023

Recently, using neural networks to simulate spatio-temporal dynamics has received a lot of attention. However, most existing methods adopt pure data-driven black-box models, which have limited accuracy and interpretability. By combining trainable difference operators with black-box models, we propose a new hybrid architecture explicitly embedded with partial prior knowledge of the underlying PDEs named PDE-Net++. Furthermore, we introduce two distinct options called the trainable flipping difference layer (TFDL) and the trainable dynamic difference layer (TDDL) for the difference operators. Numerous numerical experiments have demonstrated that PDE-Net++ has superior prediction accuracy and better extrapolation performance than black-box models.

Recommended citation: Huang, X., Li, Z., Liu, H., Wang, Z., Zhou, H., Dong, B., & Hua, B. (2023). Learning to simulate partially known spatio-temporal dynamics with trainable difference operators. arXiv preprint arXiv:2307.14395 https://arxiv.org/pdf/2307.14395.pdf