Skip to yearly menu bar Skip to main content


Oral
in
Workshop: 3rd ICLR Workshop on Machine Learning for Remote Sensing

Balancing quantity and representativeness in constrained geospatial dataset design

Livia Betti · Esther Rolf


Abstract:

Effective geospatial machine learning (GeoML) relies on high-quality labeled datasets, but geospatial data collection is often costly and logistically challenging. Creating new geospatial datasets frequently requires on-site labeling of data, including collecting data through surveys or scientific instruments. These methods incur variable costs across regions, making it difficult to gather representative ground-referenced data with budget constraints. Given that GeoML models require large datasets to perform well, ensuring both representativeness and size is critical for effective data collection. We propose a sampling method that jointly maximizes dataset size and representative composition with respect to cost constraints. We evaluate our method by training GeoML models on the optimized subsets in simulation studies and find that our method outperforms baseline methods of random sampling. Our findings underscore the competing priorities of representation and dataset size, evidencing environments where one of these factors is more important. Looking forward, our results highlight the value of further research into how sampling strategies can enhance model performance.

Chat is not available.