Poster
in
Workshop: Will Synthetic Data Finally Solve the Data Access Problem?

SyntheRela: A Benchmark For Synthetic Relational Database Generation

Martin Jurkovic ⋅ Valter Hudovernik ⋅ Erik Štrumbelj

Project Page [ Poster] [ OpenReview]

Abstract

Synthesizing relational databases has started to receive more attention from researchers, practitioners, and industry. The task is more difficult than synthesizing a single table due to the added complexity of relationships between tables. For the same reason, benchmarking methods for synthesizing relational databases introduces new challenges. Our work is motivated by a lack of an empirical evaluation of state-of-the-art methods and by gaps in the understanding of how such an evaluation should be done. We review related work on relational database synthesis, common benchmarking datasets, and approaches to measuring the fidelity and utility of synthetic data. We combine the best practices, a novel robust detection metric and relational deep learning utility, a novel approach to evaluating utility with graph neural networks, into a benchmarking tool. We use it to compare 6 open source methods over 8 real-world databases, with a total of 39 tables. The open-source SyntheRela benchmark is available on GitHub, alongside a public leaderboard.

Chat is not available.