Earlier this month, at the 9th International Conference on Fog and Mobile Edge Computing (FMEC 2024), our very own AI engineer, Gustav Nilsson presented his paper: The Role of the Data Quality on Model Efficiency: An Exploratory Study on Centralized and Federated Learning. Gustav wrote this paper in collaboration with Imagimob, as part of his studies in AI and Machine Learning to obtain a Master of Science in Engineering at Blekinge Institute of Technology (BTH).
On how the paper came about, Gustav says: "the core idea of the paper was a result of discussions with people at Imagimob and they supported me throughout the whole process!"
The whole paper will be published shortly, so stay tuned as we will share the link on our blog and on our LinkedIn in the near future. For now, you can read his abstract.
This paper investigates the impact that datasets of varying data quality levels have on centralized vs. federated learning models using experiments. We also investigate how the distribution of low-quality data across federated clients affects the models' accuracy. Within the experiments we create datasets of increasingly worse data quality in terms of the following two data quality metrics; data accuracy and data completeness. This is done by perturbing (i.e., modifying) the datasets in order to decrease the quality of the datasets with regard to these two data quality metrics. Then, three experiments are conducted that investigates; i) the impact of decreased data accuracy on the models' performance, ii) the impact of decreased data completeness, and iii) the effects of different distribution low-quality data on the clients used in the federated learning setup. The results reveal that the centralized model achieves 60.3% validation accuracy with low data accuracy and 58.7% with low data completeness. While the federated model performs better, achieving 69.3% validation accuracy with low data accuracy and 79.2% with low data completeness. The federated model is less affected by low data quality if the data quality is distributed evenly between its clients. Further, the federated learning setup displays certain attributes that make it more robust to data with low quality, compared to centralized learning. Uneven distribution of data quality between clients has a more negative impact on federated learning compared to even distribution.
Be sure to subscribe to our newsletter so you don't miss the paper in it's entirety!