Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Will Synthetic Data Finally Solve the Data Access Problem?

AN OPTIMAL CRITERION FOR STEERING DATA DISTRIBUTIONS TO ACHIEVE EXACT FAIRNESS

Mohit Sharma · Amit Jayant Deshpande · Chiranjib Bhattacharyya · Rajiv Ratn Shah


Abstract:

To fix the ‘bias in, bias out’ issue in fair machine learning, it is essential to get ideal training and validation data. Collecting ideal real-world data or generating ideal synthetic data requires a formal specification of ideal distribution that would guarantee fair outcomes by downstream models. Previous work on fair pre-processing does not address this gap, and could be significantly improved if it is resolved. We call a distribution as ideal distribution if the minimizer of any cost-sensitive risk on it is guaranteed to satisfy exact fairness (e.g., demographic parity, equal opportunity). Given any data distribution for fair classification, we formulate an optimization program to find its nearest ideal distribution in KL-divergence. This optimization is intractable as stated but we show how it can be solved efficiently when the distributions come from well-known parametric families (e.g., normal, log-normal). We empirically show on synthetic datasets that our ideal distributions are close to the given distributions and they can often suggest directions to steer the original distribution to improve both accuracy and fairness simultaneously.

Chat is not available.