Abstract
The increasing complexity of machine learning models and stricter data protection regulations are driving a growing demand for large and diverse datasets. Synthetic data, artificially generated by algorithms, represents a powerful solution to this problem and are becoming an essential tool in modern data science and artificial intelligence. Their use addresses challenges related to data scarcity, privacy protection, and ensuring sample balance. This paper provides an overview of modern synthetic data generation techniques, their applications in artificial intelligence and computer science, and discusses key challenges and directions for future research.
References
André Bauer, “Comprehensive Exploration of Synthetic Data Generation: A Survey,” 02 2024. Internet resurs: https://arxiv.org/pdf/2401.02524. [Accessed: 22/08/2025].
Yingzhou Lu, “Machine Learning for Synthetic Data Generation: A Review,” JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, T, AUGUS 2021. Internet resources: https :// arxiv. org / pdf /2302.04062 v 8. [Accessed: 01/06/2025].
Vasileios C Pezoulas, “Synthetic data generation methods in healthcare: A review on open-source tools and methods,” Comput Struct Biotechnol J. 23:2892–2910., 23 Jul 9 2024. Internet resources: https://www.sciencedirect.com/science/article/pii/S2001037024002393. [Accessed: 23/06/2025].
Shuang Hao, “Synthetic Data in AI: Challenges, Applications, and Ethical Implications,” 01 Jan 2024. Internet resources: https :// arxiv . org / html /2401.01629 v 1. [Accessed: 12/06/2025].
Lin Long, “On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey,” 14 Jul 2024. Internet resources: https :// arxiv . org / pdf /2406.15126 v 1. [Accessed: 12/06/2025].