(Semi)automated disambiguation of scholarly repositories

Authors : Miriam Baglioni, Andrea Mannocci, Gina Pavone, Michele De Bonis, Paolo Manghi

The full exploitation of scholarly repositories is pivotal in modern Open Science, and scholarly repository registries are kingpins in enabling researchers and research infrastructures to list and search for suitable repositories. However, since multiple registries exist, repository managers are keen on registering multiple times the repositories they manage to maximise their traction and visibility across different research communities, disciplines, and applications.

These multiple registrations ultimately lead to information fragmentation and redundancy on the one hand and, on the other, force registries’ users to juggle multiple registries, profiles and identifiers describing the same repository. Such problems are known to registries, which claim equivalence between repository profiles whenever possible by cross-referencing their identifiers across different registries.

However, as we will see, this « claim set » is far from complete and, therefore, many replicas slip under the radar, possibly creating problems downstream.

In this work, we combine such claims to create duplicate sets and extend them with the results of an automated clustering algorithm run over repository metadata descriptions. Then we manually validate our results to produce an « as accurate as possible » de-duplicated dataset of scholarly repositories.

URL : https://arxiv.org/abs/2307.02647