At DOAJ, we work hard to maintain a high level of recency and accuracy in our metadata. All of our metadata is freely available, in various formats, to those who want it. This means that any errors in it get distributed freely around the web. To reduce these and negate the knock-on effect, DOAJ works with its technical partners, Cottage Labs, to clean the metadata.
On 21st February, we will be releasing two small but fairly important enhancements to our article upload function. The two changes are as follows:
Spaces will be stripped from DOIs and full text URLs upon ingest.
This is to improve matching in our database on DOIs and URLs. We use DOIs and full text URLs to version articles, thereby allowing corrections or enhancements to article metadata to be uploaded without the existing version being deleted first.
We regularly receive metadata with badly formatted URLs or DOIs, with preceding spaces, trailing spaces or spaces right in the middle of a DOI or URL. This means matching doesn’t occur, we end up with multiple versions of the same article in the database and an increased number of duplicates.
Duplicates in the same file will be prevented upon upload
To upload article metadata to us, a file of article metadata must be sent to our article ingester. We will introduce an enhancement here which will prevent a file from uploading if duplicates within it are detected. We carry out no such checks at the moment.
Both enhancements will be implemented in all 3 article ingest front ends: the XML uploader, the manual article uploader, and the API.
Both changes are two small steps toward a larger project of eradicating all duplicated content in the database. We don’t know yet how much article content is duplicated but it will be enough to cause a noticeable reduction in the current number of articles in DOAJ (3,767,076 articles at the time or writing).
If you have any questions about these enhancements or are wondering if you have duplicates in your own article metadata, do please leave a comment here.