Skip to content
  • Home

Recent Posts

  • ¡Primera institución en Colombia en dar apoyo a DOAJ!
  • First Colombian University to Support DOAJ!
  • RESOLVED: DOAJ IN READ-ONLY MODE 30th January
  • First institution from the Middle East to support DOAJ!
  • Large Scale Publisher Survey reveals Global Trends in Open Access Publishing

Recent Comments

News RoundUp: The la… on Large Scale Publisher Survey r…
Amy on Large Scale Publisher Survey r…
DOAJ (Directory of O… on Large Scale Publisher Survey r…
Amy on Large Scale Publisher Survey r…
What we read this we… on Are you using an up-to-date ve…

News Archive

  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014

Categories

  • Advice and best practice
  • Application form
  • Author processing charges
  • Awards
  • Copyright and licensing
  • Crowdsourcing
  • DOAJ Ambassadors
  • Funding
  • Governance
  • Guest Post
  • IDRC Project
  • In development
  • Letter
  • Metadata
  • New feature
  • News update
  • Open access
  • Open Access policy
  • Open Access Week
  • OpenAIRE
  • Press release
  • Progress report
  • Promotional materials
  • Questionable journals
  • Reapplications
  • SCOSS
  • Supporters
  • Think. Check. Submit.
  • Training & education
  • Uncategorized
  • Using the DOAJ

Meta

  • Register
  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.com

News Service

News, updates & developments from DOAJ

Metadata

Are you using an up-to-date version of DOAJ metadata? Apparently not.

Did you know that DOAJ’s metadata refreshes every hour? Given how many changes we, the DOAJ Team, make to the database on a weekly basis plus all the article metadata which is continuously being uploaded to us, the hourly update is absolutely necessary. DOAJ is seeing more use of its API than ever before (2018 total: 266,255,000 hits; 2017 total: 187,212,674) and most of that is by publishers uploading articles.

DOAJ metadata is EVERYWHERE. It is incorporated into all the major discovery services, other online databases and many library catalogues.

You might remember this post of mine from 2017, where I explained that third parties using our data may not have a recent copy of it. We know for a fact that some systems downstream of our data only update the copy every 6 months and, sadly, we get many emails from end users asking why they can’t find an article.

So if you are a researcher, reading this and wondering what happened to that article in DOAJ which is listed in your library catalogue or discovery service, contact your librarians and ask them to update their metadata. If you’re a librarian then do please reach out to sources responsible for your metadata and get them to update their DOAJ metadata.

If you have any questions about this or need some help in getting a more recent copy of our metadata then do leave a comment here or contact us at feedback@doaj.org.

Posted on 08/01/201911/01/2019 by DOAJ (Directory of Open Access Journals) Posted in Advice and best practice, Metadata, Using the DOAJ Tagged API, article metadata, journal metadata, metadata 1 Comment

Infrastructure and why sustainable funding so important to services like DOAJ

You can’t run a service like DOAJ on a shoestring, or fund it on a hand-to-mouth basis, which is how DOAJ has been funded until last year, so large amounts of sustained funding are really crucial to the database’s security, stability, scalability and, ultimately, survival. Last month, DOAJ came under attack and the whole process from diagnosing the problem to resolution revealed just how fragile the DOAJ ecosystem can be under those circumstances.

Below is a guest blog post is by Steven Eardley, the DevOps Engineer at our technology partner, Cottage Labs.  You may remember that the SCOSS initiative was launched last year to provide DOAJ and Sherpa/RoMEO with levels of sustainable funding that would ensure the survival of 2 services which had been earmarked as vital to the open access movement. I asked Steven to write this post because, after DOAJ took a battering, I thought it would be useful for our community to understand exactly what happens when DOAJ becomes unresponsive, what it takes to keep DOAJ up and running, and what a difference sustainable funding can make for a service like DOAJ.

Thanks for reading!

Dom


The Directory of Open Access Journals experienced some sustained and repeated downtime events mid July to mid August, which required significant intervention from Cottage Labs. This post describes what happened and explains how we mitigated the issue, and illustrates some of the inherent vulnerability which needs to be addressed.

The system
The main DOAJ database is built on ElasticSearch which we run over a number of virtual machines. The DOAJ web application is built on the Flask web framework and is written in Python.

As well as the search interface on the website, searches to our database come from the OAI-PMH interface, our Atom feed, and ‘widgets‘ embedded on users’ sites. The DOAJ also has an API for programmatic interactions with our content allowing, for example, publishers to upload their content. To keep the ElasticSearch index performing correctly for the rest of the site, we tend to tweak the rate limit for the API during periods of high load.

The problem
The DOAJ website communicates with its database (“data store”) via the web (also know as the ‘query endpoint’), giving us control over what kinds of requests are permitted and the ability to filter responses to prevent data from being leaked unintentionally. Not all of the requests to ElasticSearch come from our own servers however because the DOAJ search pages utilise JavaScript which runs on users’ browsers. This means that the data store must be publicly accessible for our search pages to work. We keep the rate limit for this fairly high so pages are responsive.

During the period mentioned, we saw an elevated request rate via the query endpoint which resulted in all of the site functions slowing down. Eventually the index was overloaded and the DOAJ website stopped responding. Recovery tended to be fairly quick and site performance would be restored for a number of hours. However, the downtime was persistent and reoccurring, sometimes multiple times in a day.

The cause
While we could see the quantity of queries to our index had at least doubled from its normal rate, we couldn’t identify the source of the queries causing the trouble and couldn’t look for long-term trends. We will find resource to improve this in the future but monitoring of this kind is costly and is currently out of budget. Therefore, we mainly had to diagnose via inference. Our immediate action was to disable the application components one by one, endeavouring to pinpoint the source of the excess load. We turned off the API, the Atom feed, the back end editorial admin pages, and finally the OAI-PMH interface. The latter had a small impact on the index stability – we saw reduced memory use in the index and it was coping somewhat better with the load. This pointed us towards the source of the problem: deep paging.

Deep Paging – ElasticSearch’s kryptonite
Deep paging essentially means that someone or something is scrolling through a large number of objects sequentially leading to high memory commitment; the server must hold the entire context in memory to get deeper and deeper into the results. We determined that someone was sending thousands of requests per hour directly via the query endpoint, instead of through the API, and furthermore essentially trawling through all results over a long period of time. That is to say: they weren’t using a provided feature, rather bypassing the features and using a hidden route.

Each time the system went down, the traffic took a while to resume, presumably because the trawl had to be started again from the beginning.

The solution
Our next task was therefore to block these external sources of deep paging. First, we changed the permitted referral parameter (e.g. “`ref=ui“` in the request URL) and the traffic dropped off. Success? Not quite. These parameters are easy to spoof so it was of no surprise that not long later, the site went down once again amid further high traffic. We tried the same configuration change again, this time changing the referral parameter to “`please-use-our-api“` – the ‘begging’ approach to reducing system load! Unfortunately that bought us even less time than before.

At this point we categorised this as something like a denial of service attack – although the chances are it wasn’t intentionally malicious, circumvention of our countermeasures and being a direct source of instability for thousands of other users is at least inconsiderate!

User agent blocking was the next measure we attempted to reduce the load. Since the query route is really just for our use, we decided it was reasonable to block some programmatic user agents, such as the Node.js http module, or the wget and curl utilities. Again we saw a temporary alleviation of the high load.

Of course, user agent strings are sent from the client from each request, so they can be altered at source. By now we’d identified the user’s IP address, and watched a few intermittent requests come in as the user was iterating and figuring out what we’d changed. Within a couple of hours, the torrent of requests resumed and our infrastructure once again creaked under the strain.

Since we’re not keen on IP banning and other heavy-handed measures we decided to address the problem within our application code and give the query endpoint some rules to filter out the sorts of traffic that give the index trouble: we now cap the results to a reasonable number expected to be viewed from the search page; we enforce paging limits to disallow queries that page a long way into our data set; and the query endpoint will reject a request that doesn’t look like it came from a user.

The lesson
We were a little complacent about our query endpoint’s obscurity. For a long time we’d planned to tighten up the rules on the search interface but, with DOAJ’s long list of other developments, it wasn’t a priority. Our endpoint was open and available to function like a little bonus undocumented API and this was stretched to the limit when it was being used to harvest all of our data rather than just facilitating search functions. This is a lesson for the design stage of the software process: keep the roles of system components as distinct and single-purpose as feasible.

In addition, this persistent grab of our data constituted a feature request – it should be easier to download our entire DOAJ dataset and we will be implementing that over the coming weeks.

The improvements
We have a handful of improvements coming soon in response to these lessons:

  • We’ll be re-writing our OAI-PMH interface to mitigate deep paging and high memory use.
  • The Search API will also see some more restrictions to paging depth and number of results – this will more closely reflect its role as a discovery interface and not a harvest endpoint.
  • Create a dump of our entire dataset on a regular basis, that way the entirety of the DOAJ’s data is more easily accessible to the public without strain on our infrastructure.

Watch out here and on our social media channels for further announcements regarding these new developments.

Posted on 01/10/201826/09/2018 by DOAJ (Directory of Open Access Journals) Posted in Guest Post, Metadata Tagged Cottage Labs, deep paging, downtime, ElasticSearch, SCOSS, Steven Eardley 1 Comment

Update on discontinued journals

DOAJ will remove all journals that have ceased to publish unless they are continued by another title.

Although reversing a previous decision, we have taken this step after careful consideration in order to keep our metadata as relevant and as accurate as possible. Although there are clear advantages to retaining published articles from discontinued journals, there is also a downside. With the current number of discontinued journals in DOAJ standing at 208, over time this generates a lot of broken links—link rot increases over time—affecting the quality of our metadata. All of the library catalogues, discovery services, search engines and other downstream services which rely heavily on our metadata start inheriting that link rot.

[UPDATE: I should clarify that removing a journal and its content from DOAJ simply makes it unavailable to the public. DOAJ doesn’t delete anything. We are hoping to collaborate with an archiving partner to help preserve journal content. If those plans come to fruition there is no reason why some removed journals could not be reinstated. However, it is early days on those discussions. – Dom]

 

Posted on 19/03/201820/03/2018 by DOAJ (Directory of Open Access Journals) Posted in Metadata, Using the DOAJ 8 Comments

Parts of DOAJ metadata in my catalogue/discovery service/A-Z List have changed, are out-of-date or are missing completely. Why?

It’s time to update your index and/or ask your discovery service provider to do so!

Designed by Freepik

We have received  several messages via our Feedback account from library catalogue administrators, their readers, and researchers asking why all of a sudden so many of the articles indexed from DOAJ are now missing? Within library catalogues, users are perhaps seeing a 404 message. On DOAJ itself, users are noticing that the article content included in DOAJ is much changed and reduced. Why is this?

Hopefully it is old news to you now that DOAJ has been undertaking its Reapplication project. Over the last 6 months, the DOAJ Team has made a final, huge push to process all the reapplications. This project is so very nearly complete: out of the 6359 reapplications submitted to us, only a handful are still to be re-checked. 2072 of the reapplications did not pass our new standards. This means that the range of journals in DOAJ has changed considerably; very quickly over a relatively short period of time.

Aggregators / Service Providers / Library Catalogue administrators: we strongly advise you to re-index / update / download again the DOAJ article metadata offering—via OAI-PMH, or our API—so that your catalogues have the most up-to-date version of our index. The DOAJ index changes weekly so it might be time to revisit how often you come to us for updates and increase that frequency. Learn more about how to interact with DOAJ.

If you are using the search or browse features at DOAJ, then be aware that some of the journals you are used to seeing in the Directory may no longer be there.

If you have any questions about the effect of the reapplication project on DOAJ, about indexing DOAJ via our API or OAI-PMH services, or even how to search and browse, then please leave a comment here or send email to feedback@doaj.org.

Posted on 25/09/201725/09/2017 by DOAJ (Directory of Open Access Journals) Posted in Metadata, Progress report, Reapplications, Using the DOAJ Tagged advice, database, indexing, librarians, search, subject browser 2 Comments

We have improved our XML validation

xml

We are happy to announce that we have improved the features of our xml validation. As an article XML is uploaded to DOAJ from the publisher area we will check to see if the uploaded file meets a number of criteria:

  • Is it actually XML? (have you given us a PDF by mistake?)
  • Does the XML meet the DOAJ article schema requirements?
If your upload doesn’t meet either of those criteria you will get an error message straight away.  It will either tell you we didn’t think the file was XML, or it will give you a message from our validator which identifies the first problem found when validating.
If your file does pass those checks, it will get entered into our upload queue.  You will need to check back on your upload page in a while to see if the upload was processed successfully.  When we process your upload, we carry out some more checks to make sure that the data in DOAJ remains consistent:
  • Do you reference any ISSNs that belong to journals you don’t own?  You can’t attach articles to someone else’s journals.
  • Do you reference any ISSNs that don’t appear in DOAJ at all?  We only take uploads for journals we know about, and your journal records need to contain all the correct identifiers if you want to reference them.
  • Do you reference ISSNs that are owned by more than one user?  Sometimes it’s possible for the DOAJ to think that more than one user has a stake in an ISSN.  In these cases we need to resolve who the true owner is before you can upload.
If any of these situations arise with your file upload, we won’t import any of the articles, and instead record an error against the upload.  When you look at the upload record on the page you will see a link in the Notes column which says “show error details“.  When you click on that it will tell you in what way the articles failed to process, and which ISSNs were causing the problems.
If you see any other error messages on your upload page, you should click on the “show error details” link if it is available.  If not, you should contact us at feedback@doaj.org and we will get back to you as soon as we can.

 

Posted on 16/05/201711/07/2017 by claradoaj Posted in Metadata, New feature, News update, Using the DOAJ 1 Comment

The Redalyc JATS project strides towards standardisation.

Every now and then we hear news that is music to our ears and this time its news from Redalyc who have recently announced their project to transition their journals to the JATS XML format. The decision to make the transition was taken last year and on August 29th Redalyc will be launching their new online tool (Marcalyc) for XML JATS markup. From the press release:

‘Redalyc has undertaken a transition to the adoption of XML-JATS which provides a standardized format for describing and exchanging structured data. Redalyc’s strategy is based on empowering editors by providing tools and knowledge to make XML tagging a sustainable process. Redalyc is launching a new online tool (Marcalyc) for XML JATS markup—compatible with the JATS4R recommendation—as a free service for the Open Access journals indexed by Redalyc, a tool which is not designed to be used by technical experts or programmers.

Once having XML files, Redalyc will provide enriched file formats like ePUB and intelligent readers resulting in greater visibility and accessibility for Open Access research in Latin America, lowering costs for journals and leveraging the power of new technologies.’*

Redalyc are also launching an XML JATS reader which will improve the reader experience for the user, with powerful features like responsiveness, reference linking, search inside, image gallery, section navigation and automatic conversion to ePUB, PDF and HTML, all based in JATS4R specification. They have also provided a free online certification course (https://attendee.gotowebinar.com/register/9160206569091767555), the last of which runs today, in which they have trained editors in the adoption of JATS.

Arianna Becerril, Director of Technology and Innovation at Redalyc and DOAJ Advisory Board member, said: “We believe the adoption of JATS in Redalyc will help the Open Access science produced in Latin America achieve a greater international visibility, increasing the possibilities for that content to be discovered, read and cited. And also, this new feature will help journals to reach higher standards like the DOAJ criteria in the case of machine readability. [Today] we’re feeding article metadata for almost 100 journals into DOAJ with links to the full-text PDF. Now we’ll be providing links not only to the PDF file but also ePUB, the smart reader and HTML for the journals with XML markup.”

Well done Redalyc! This will surely play an important role in the continued growth of academic publishing from Latin America.

*much more information about this excellent project is available in Spanish at https://xmljatsredalyc.org/

Posted on 25/08/2016 by DOAJ (Directory of Open Access Journals) Posted in Advice and best practice, Metadata Tagged Arianna Becerril, JATS, JATS4R, Latin America, metadata, Redalyc, XML

A lot goes a long way: data quality improvement at DOAJ

Since 2012, DOAJ has been on a path of data quality improvement. DOAJ metadata is used all over the world and all over the Web. Improving and fixing the quality of our metadata can be painstaking work but the effort goes a long way as changes propagate across the Web via search engines, aggregator databases, library portals and other databases.

Along these lines, the largest publisher (in terms of the number of journals) in DOAJ recently added missing abstracts to over 100,000 articles and fixed broken special characters in approximately 2000 more. This was a huge effort on their part and DOAJ is grateful for the work that has gone into this project. It is an achievement that will be welcomed by DOAJ metadata consumers.

To date, Hindawi has 161,334 articles loaded to DOAJ and until recently was the largest contributor of metadata to our index. That title was taken from them recently when DOAJ ingested the entire PLOS archive from Europe PMC.

Posted on 13/07/2016 by DOAJ (Directory of Open Access Journals) Posted in Metadata, News update, Progress report Tagged article counts, article metadata, data quality, Hindawi, metadata, PLoS 6 Comments

Posts navigation

← Older posts
Blog at WordPress.com.
News Service
Blog at WordPress.com.
Cancel