You can’t run a service like DOAJ on a shoestring, or fund it on a hand-to-mouth basis, which is how DOAJ has been funded until last year, so large amounts of sustained funding are really crucial to the database’s security, stability, scalability and, ultimately, survival. Last month, DOAJ came under attack and the whole process from diagnosing the problem to resolution revealed just how fragile the DOAJ ecosystem can be under those circumstances.
Below is a guest blog post is by Steven Eardley, the DevOps Engineer at our technology partner, Cottage Labs. You may remember that the SCOSS initiative was launched last year to provide DOAJ and Sherpa/RoMEO with levels of sustainable funding that would ensure the survival of 2 services which had been earmarked as vital to the open access movement. I asked Steven to write this post because, after DOAJ took a battering, I thought it would be useful for our community to understand exactly what happens when DOAJ becomes unresponsive, what it takes to keep DOAJ up and running, and what a difference sustainable funding can make for a service like DOAJ.
Thanks for reading!
Dom
The Directory of Open Access Journals experienced some sustained and repeated downtime events mid July to mid August, which required significant intervention from Cottage Labs. This post describes what happened and explains how we mitigated the issue, and illustrates some of the inherent vulnerability which needs to be addressed.
The system
The main DOAJ database is built on ElasticSearch which we run over a number of virtual machines. The DOAJ web application is built on the Flask web framework and is written in Python.
As well as the search interface on the website, searches to our database come from the OAI-PMH interface, our Atom feed, and ‘widgets‘ embedded on users’ sites. The DOAJ also has an API for programmatic interactions with our content allowing, for example, publishers to upload their content. To keep the ElasticSearch index performing correctly for the rest of the site, we tend to tweak the rate limit for the API during periods of high load.
The problem
The DOAJ website communicates with its database (“data store”) via the web (also know as the ‘query endpoint’), giving us control over what kinds of requests are permitted and the ability to filter responses to prevent data from being leaked unintentionally. Not all of the requests to ElasticSearch come from our own servers however because the DOAJ search pages utilise JavaScript which runs on users’ browsers. This means that the data store must be publicly accessible for our search pages to work. We keep the rate limit for this fairly high so pages are responsive.
During the period mentioned, we saw an elevated request rate via the query endpoint which resulted in all of the site functions slowing down. Eventually the index was overloaded and the DOAJ website stopped responding. Recovery tended to be fairly quick and site performance would be restored for a number of hours. However, the downtime was persistent and reoccurring, sometimes multiple times in a day.
The cause
While we could see the quantity of queries to our index had at least doubled from its normal rate, we couldn’t identify the source of the queries causing the trouble and couldn’t look for long-term trends. We will find resource to improve this in the future but monitoring of this kind is costly and is currently out of budget. Therefore, we mainly had to diagnose via inference. Our immediate action was to disable the application components one by one, endeavouring to pinpoint the source of the excess load. We turned off the API, the Atom feed, the back end editorial admin pages, and finally the OAI-PMH interface. The latter had a small impact on the index stability – we saw reduced memory use in the index and it was coping somewhat better with the load. This pointed us towards the source of the problem: deep paging.
Deep Paging – ElasticSearch’s kryptonite
Deep paging essentially means that someone or something is scrolling through a large number of objects sequentially leading to high memory commitment; the server must hold the entire context in memory to get deeper and deeper into the results. We determined that someone was sending thousands of requests per hour directly via the query endpoint, instead of through the API, and furthermore essentially trawling through all results over a long period of time. That is to say: they weren’t using a provided feature, rather bypassing the features and using a hidden route.
Each time the system went down, the traffic took a while to resume, presumably because the trawl had to be started again from the beginning.
The solution
Our next task was therefore to block these external sources of deep paging. First, we changed the permitted referral parameter (e.g. “`ref=ui“` in the request URL) and the traffic dropped off. Success? Not quite. These parameters are easy to spoof so it was of no surprise that not long later, the site went down once again amid further high traffic. We tried the same configuration change again, this time changing the referral parameter to “`please-use-our-api“` – the ‘begging’ approach to reducing system load! Unfortunately that bought us even less time than before.
At this point we categorised this as something like a denial of service attack – although the chances are it wasn’t intentionally malicious, circumvention of our countermeasures and being a direct source of instability for thousands of other users is at least inconsiderate!
User agent blocking was the next measure we attempted to reduce the load. Since the query route is really just for our use, we decided it was reasonable to block some programmatic user agents, such as the Node.js http module, or the wget and curl utilities. Again we saw a temporary alleviation of the high load.
Of course, user agent strings are sent from the client from each request, so they can be altered at source. By now we’d identified the user’s IP address, and watched a few intermittent requests come in as the user was iterating and figuring out what we’d changed. Within a couple of hours, the torrent of requests resumed and our infrastructure once again creaked under the strain.
Since we’re not keen on IP banning and other heavy-handed measures we decided to address the problem within our application code and give the query endpoint some rules to filter out the sorts of traffic that give the index trouble: we now cap the results to a reasonable number expected to be viewed from the search page; we enforce paging limits to disallow queries that page a long way into our data set; and the query endpoint will reject a request that doesn’t look like it came from a user.
The lesson
We were a little complacent about our query endpoint’s obscurity. For a long time we’d planned to tighten up the rules on the search interface but, with DOAJ’s long list of other developments, it wasn’t a priority. Our endpoint was open and available to function like a little bonus undocumented API and this was stretched to the limit when it was being used to harvest all of our data rather than just facilitating search functions. This is a lesson for the design stage of the software process: keep the roles of system components as distinct and single-purpose as feasible.
In addition, this persistent grab of our data constituted a feature request – it should be easier to download our entire DOAJ dataset and we will be implementing that over the coming weeks.
The improvements
We have a handful of improvements coming soon in response to these lessons:
- We’ll be re-writing our OAI-PMH interface to mitigate deep paging and high memory use.
- The Search API will also see some more restrictions to paging depth and number of results – this will more closely reflect its role as a discovery interface and not a harvest endpoint.
- Create a dump of our entire dataset on a regular basis, that way the entirety of the DOAJ’s data is more easily accessible to the public without strain on our infrastructure.
Watch out here and on our social media channels for further announcements regarding these new developments.
Thanks for the write up! I’ve always wondered how the endpoint was exposed. Keeping the ElasticSearch endpoint out of reach of the internet is a common approach, I believe.
Publishing dumps is definitely a good decision. If dumps are produced regularly (even just every few months), and archived on long-term storage such as the Internet Archive, it also becomes less likely that someone will feel the need to download everything for the sake of preservation.