Detecting Near Duplicate Dataset with Machine Learning - Laboratoire d'Informatique de Paris-Nord Accéder directement au contenu
Article Dans Une Revue International Journal of Computer Information Systems and Industrial Management Applications Année : 2022

Detecting Near Duplicate Dataset with Machine Learning

Marc Chevallier
Nicoleta Rogovschi
  • Fonction : Auteur
  • PersonId : 840559
Faouzi Boufarès
  • Fonction : Auteur
  • PersonId : 834290
  • IdRef : 030896916
Nistor Grozavu
Charly Clairmont
  • Fonction : Auteur
  • PersonId : 1149709

Résumé

This paper introduces the concept of near duplicate dataset, a quasi-duplicate version of a dataset. This version has undergone an unknown number of row and column insertions and deletions (modifications on schema and instance). This concept is interesting for data exploration, data integration and data quality. To formalise these insertions and deletions, two parameters are introduced. Our technique for detecting these quasi-duplicate datasets is based on features extraction and machine learning. This method is original because it does not rely on classical techniques of comparisons between columns but on the comparison of metadata vectors summarising the datasets. In order to train these algorithms, we introduce a method to artificially generate training data. We perform several experiments to evaluate the best parameters to use when creating training data and the performance of several classifiers. In the studied cases, these experiments lead us to an accuracy rate higher than 95%.
Fichier principal
Vignette du fichier
Detecting_near_duplicate_dataset_with_machine_learning_IJCISIM3 (1).pdf (1.56 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03722301 , version 1 (13-07-2022)

Identifiants

  • HAL Id : hal-03722301 , version 1

Citer

Marc Chevallier, Nicoleta Rogovschi, Faouzi Boufarès, Nistor Grozavu, Charly Clairmont. Detecting Near Duplicate Dataset with Machine Learning. International Journal of Computer Information Systems and Industrial Management Applications, 2022, 14, pp.374-385. ⟨hal-03722301⟩
92 Consultations
333 Téléchargements

Partager

Gmail Facebook X LinkedIn More