additional-information
ERDDAP™ - Set Up Your Own ERDDAP™
Things You Need To Know
Proxy Errors
Sometimes, a request to ERDDAP™ will return a Proxy Error, an HTTP 502 Bad Gateway Error, or some similar error. These errors are being thrown by Apache or Tomcat, not ERDDAP™ itself.
- If every request generates these errors, especially when you are first setting up your ERDDAP™, then it probably is a proxy or bad gateway error, and the solution is probably to fix ERDDAP's proxy settings. This may also be the problem when an established ERDDAP™ suddenly starts throwing these errors for every request.
- Otherwise, "proxy" errors are usually actually time out errors thrown by Apache or Tomcat. Even when they happen relatively quickly, it is some sort of response from Apache or Tomcat that occurs when ERDDAP™ is very busy, memory-limited, or limited by some other resource. In these cases, see the advice below to deal with ERDDAP™ responding slowly.
Requests for a long time range (>30 time points) from a gridded dataset are prone to time out failures, which often appear as Proxy Errors, because it takes significant time for ERDDAP™ to open all of the data files one-by-one. If ERDDAP™ is otherwise busy during the request, the problem is more likely to occur. If the dataset's files are compressed, the problem is more likely to occur, although it's hard for a user to determine if a dataset's files are compressed.
The solution is to make several requests, each with a smaller time range. How small of a time range? I suggest starting really small (~30 time points?), then (approximately) double the time range until the request fails, then go back one doubling. Then make all the requests (each for a different chunk of time) needed to get all of the data.
An ERDDAP™ administrator can lessen this problem by increasing the Apache timeout settings.
Monitoring
We all want our data services to find their audience and be extensively used, but sometimes your ERDDAP™ may be used too much, causing problems, including super slow responses for all requests. Our plan to avoid problems is:
-
Monitor ERDDAP™ via the status.html web page.
It has tons of useful information. If you see that a huge number of requests are coming in, or tons of memory being used, or tons of failed requests, or each Major LoadDatasets is taking a long time, or see any sign of things getting bogged down and responding slowly, then look in ERDDAP's log.txt file to see what's going on.It's also useful to simply note how fast the status page responds. If it responds slowly, that is an important indicator that ERDDAP™ is very busy.
-
Monitor ERDDAP™ via the Daily Report email.
-
Watch for out-of-date datasets via the baseUrl/erddap/outOfDateDatasets.html web page which is based on the optional testOutOfDate global attribute.
External Monitors
The methods listed above are ERDDAP's ways of monitoring itself. It is also possible to make or use external systems to monitor your ERDDAP. One project to do this is Axiom's erddap-metrics project. Such external systems have some advantages:
- They can be customized to provide the information you want, displayed in the way you want.
- They can include information about ERDDAP™ that ERDDAP™ can't access easily or at all (for example, CPU usage, disk free space, ERDDAP™ response time as seen from the user's perspective, ERDDAP™ uptime,
- They can provide alerts (emails, phone calls, texts) to administrators when problems exceed some threshold.
Multiple Simultaneous Requests
-
Blacklist users making multiple simultaneous requests! If it is clear that some user is making more than one simultaneous request, repeatedly and continuously, then add their IP address to ERDDAP's <requestBlacklist> in your datasets.xml file. Sometimes the requests are all from one IP address. Sometimes they are from multiple IP addresses, but clearly the same user. You can also blacklist people making tons of invalid requests or tons of mind-numbingly inefficient requests.
Then, for each request they make, ERDDAP™ returns:
HTTP ERROR 403 - Access Forbidden --
Your IP address is on this ERDDAP's request blacklist.
Did you often submit more than one request at a time?
Did you often submit identical requests in a short period of time?
Did you submit a large number of invalid requests?
If you are ready to avoid these problems, please email [ERDDAP™ administrator's email address] to request to be taken off of the blacklist.Hopefully the user will see this message and contact you to find out how to fix the problem and get off the blacklist. Sometimes, they just switch IP addresses and try again.
It is like the balance of power between offensive and defensive weapons in war. Here, the defensive weapons (ERDDAP) have a fixed capacity, limited by the number of cores in the CPU, the disk access bandwidth, and the network bandwidth. But the offensive weapons (users, notably scripts) have unlimited capacity:
- A single request for data from a lot of time points may cause ERDDAP to open a huge number of files (in sequence or partly multi-threaded). In extreme cases, one "simple" request can easily tie up the RAID attached to ERDDAP™ for a minute, effectively blocking the handling of other requests.
- A single request may consume a large chunk of memory (even though ERDDAP™ is coded to minimize the memory needed to handle large requests).
- Parallelization -
It is easy for a clever user to parallelize a big task by generating lots of threads, each of which submits a separate request (which may be large or small). This behavior is encouraged by the computer science community as an efficient way to deal with a large problem (and parallelizing is efficient in other circumstances). Going back to the war analogy: users can make an essentially unlimited number of simultaneous requests with the cost of each being essentially zero, but the cost of each request coming into ERDDAP™ can be large and ERDDAP's response capability is finite. Clearly, ERDDAP™ will lose this battle, unless the ERDDAP™ administrator blacklists users who are making multiple simultaneous requests which are unfairly crowding out other users.
- Multiple Scripts -
Now think about what happens when there are several clever users each running parallelized scripts. If one user can generate so many requests that other users are crowded out, then multiple such users can generate so many requests that ERDDAP™ becomes overwhelmed and seemingly unresponsive. It is effectively a DDOS attack Again, the only defense for ERDDAP™ is to blacklist users making multiple simultaneous requests which are unfairly crowding out other users.
- Inflated Expectations -
In this world of massive tech companies (Amazon, Google, Facebook, ...), users have come to expect essentially unlimited capabilities from the providers. Since these companies are money making operations, the more users they have, the more revenue they have to expand their IT infrastructure. So they can afford a massive IT infrastructure to handle requests. And they cleverly limit the number of requests and cost of each request from users by limiting the kinds of requests that users can make so that no single request is burdensome, and there is never a reason (or a way) for users to make multiple simultaneous requests. So these huge tech companies may have far more users than ERDDAP™, but they have massively more resources and clever ways to limit the requests from each user. It's a manageable situation for the big IT companies (and they get rich!) but not for ERDDAP™ installations. Again, the only defense for ERDDAP™ is to blacklist users making multiple simultaneous requests which are unfairly crowding out other users.
So users: Don't make multiple simultaneous requests or you will be blacklisted!
- A single request for data from a lot of time points may cause ERDDAP to open a huge number of files (in sequence or partly multi-threaded). In extreme cases, one "simple" request can easily tie up the RAID attached to ERDDAP™ for a minute, effectively blocking the handling of other requests.
Clearly, it is best if your server has a lot of cores, a lot of memory (so you can allocate a lot of memory to ERDDAP™, more than it ever needs), and a high bandwidth internet connection. Then, memory is rarely or never a limiting factor, but network bandwidth becomes the more common limiting factor. Basically, as there are more and more simultaneous requests, the speed to any given user decreases. That naturally slows down the number of requests coming in if each user is just submitting one request at a time.
ERDDAP™ Getting Data from THREDDS
If your ERDDAP™ gets some of its data from a THREDDS at your site, there are some advantages to making a copy of the THREDDS data files (at least for the most popular datasets) on another RAID that ERDDAP™ has access to so that ERDDAP™ can serve data from the files directly. At ERD, we do that for our most popular datasets.
- ERDDAP™ can get the data directly and not have to wait for THREDDS to reload the dataset or ...
- ERDDAP™ can notice and incorporate new data files immediately, so it doesn't have to pester THREDDS frequently to see if the dataset has changed. See <updateEveryNMillis>.
- The load is split between 2 RAIDS and 2 servers, instead of the request being hard on both ERDDAP™ and THREDDS.
- You avoid the mismatch problem caused by THREDDS having a small (by default) maximum request size. ERDDAP™ has a system to handle the mismatch, but avoiding the problem is better.
- You have a backup copy of the data which is always a good idea.
In any case, don't ever run THREDDS and ERDDAP™ in the same Tomcat. Run them in separate Tomcats, or better, on separate servers.
We find that THREDDS periodically gets in a state where requests just hang. If your ERDDAP™ is getting data from a THREDDS and the THREDDS is in this state, ERDDAP™ has a defense (it says the THREDDS-based dataset isn't available), but it is still troublesome for ERDDAP™ because ERDDAP™ has to wait until the timeout each time it tries to reload a dataset from a hung THREDDS. Some groups (including ERD) avoid this by proactively restarting THREDDS frequently (e.g., nightly in a cron job).
Responding Slowly
-
If ERDDAP™ Is Responding Slowly or if just certain requests are responding slowly,
you may be able to figure out if the slowness is reasonable and temporary (e.g., because of lots of requests from scripts or WMS users), or if something is inexplicably wrong and you need to shut down and restart Tomcat and ERDDAP™.If ERDDAP™ is responding slowly, see the advice below to determine the cause, which hopefully will enable you to fix the problem.
You may have a specific starting point (e.g., a specific request URL) or a vague starting point (e.g., ERDDAP™ is slow).
You may know the user involved (e.g., because they emailed you), or not.
You may have other clues, or not.
Since all of these situations and all of the possible causes of the problems blur together, the advice below tries to deal with all possible starting points and all possible problems related to slow responses.-
Look for clues in ERDDAP's log file (bigParentDirectory/logs/log.txt).
[On rare occasions, there are clues in Tomcat's log file (tomcat/logs/catalina.out).]
Look for error messages.
Look for a large number of requests coming from one (or a few) users and perhaps hogging a lot of your server's resources (memory, CPU time, disk access, internet bandwidth).If the trouble is tied to one user, you can often get a clue about who the user is via web services like https://whatismyipaddress.com/ip-lookup that can give you information related to the user's IP address (which you can find in ERDDAP's log.txt file).
- If the user seems to be a bot behaving badly (notably, a search engine trying to fill out the ERDDAP™ forms with every possible permutation of entry values), make sure you have properly set up your server's robots.txt file.
- If the user seems to be a script(s) that is making multiple simultaneous requests, contact the user, explain that your ERDDAP™ has limited resources (e.g., memory, CPU time, disk access, internet bandwidth), and ask them to be considerate of other users and just make one request at a time. You might also mention that you will blacklist them if they don't back off.
- If the user seems to be a script making a large number of time-consuming requests, ask the user to be considerate of other users by putting a small pause (2 seconds?) in the script between requests.
- WMS client software can be very demanding. One client will often ask for 6 custom images at a time. If the user seems to be a WMS client that is making legitimate requests, you can:
- Ignore it. (recommended, because they'll move on pretty soon)
- Turn off your server's WMS service via ERDDAP's setup.html file. (not recommended)
- If the requests seem stupid, insane, excessive, or malicious, or if you can't resolve the problem any other way, consider temporarily or permanently adding the user's IP address to the <requestBlacklist> in your datasets.xml file.
-
Try to duplicate the problem yourself, from your computer.
Figure out if the problem is with one dataset or all datasets, for one user or all users, for just certain types of requests, etc..
If you can duplicate the problem, try to narrow down the problem.
If you can't duplicate the problem, then the problem may be tied to the user's computer, the user's internet connection, or your institution's internet connection.
-
If just one dataset is responding slowly (perhaps only for one type of request from one user), the problem may be:
-
ERDDAP's access to the dataset's source data (notably from relational databases, Cassandra, and remote datasets) may be temporarily or permanently slow. Try to check the source's speed independent of ERDDAP. If it is slow, perhaps you can improve it.
-
Is the problem related to the specific request or general type of request?
The larger the requested subset of a dataset, the more likely the request will fail. If the user is making huge requests, ask the user to make smaller requests that are more likely to get a fast and successful response.Almost all data sets are better at handling some types of requests than others types of requests. For example, when a dataset stores different time chunks in different files, requests for data from a huge number of time points may be very slow. If the current requests are of a difficult type, consider offering a variant of the dataset that is optimized for these requests. Or just explain to the user that that type of request is difficult and time consuming, and ask for their patience.
-
The dataset may be not optimally configured. You may be able to make changes to the dataset's datasets.xml chunk to help ERDDAP™ handle the dataset better. For example,
- EDDGridFromNcFiles datasets that access data from compressed nc4/hdf5 files are slow when getting data for the entire geographic range (e.g., for a world map) because the entire file must be decompressed. You could convert the files to uncompressed files, but then the disk space requirement will be much, much larger. It is probably better to just accept that such datasets will be slow in certain circumstances.
- The configuration of the <subsetVariables> tag has a huge influence on how ERDDAP™ handles EDDTable datasets.
- You may be able to increase the speed of an EDDTableFromDatabase dataset.
- Many EDDTable datasets can be sped up by storing a copy of the data in NetCDF Contiguous Ragged Array files, which ERDDAP™ can read very quickly.
If you want help speeding up a specific dataset, include a description of the problem and the dataset's chunk of datasets.xml and see our section on getting additional support.
-
-
If everything in ERDDAP™ is always slow, the problem may be:
- The computer that is running ERDDAP™ may not have enough memory or processing power. It is good to run ERDDAP™ on a modern, multi-core server. For heavy use, the server should have a 64-bit operating system and 8 GB or more of memory.
- The computer that is running ERDDAP™ may be also running other applications that are consuming lots of system resources. If so, can you get a dedicated server for ERDDAP? For example (this is not an endorsement), you can get a quad-core Mac Mini Server with 8 GB of memory for ~$1100.
-
If everything in ERDDAP™ is temporarily slow, view your ERDDAP's /erddap/status.html page in your browser.
- Does the ERDDAP™ status page fail to load?
If so, restart ERDDAP™. - Did the ERDDAP™ status page load slowly (e.g., >5 seconds)?
That is a sign that everything in ERDDAP™ is running slowly, but it isn't necessarily trouble. ERDDAP™ may just be really busy. - For "Response Failed Time (since last major LoadDatasets)", is n= a large number?
That indicates there have been lots of failed requests recently. That may be trouble or the start of trouble. The median time for the failures is often large (e.g., 210000 ms),
which means that there were (are?) lots of active threads.
which were tying up lots of resources (like memory, open files, open sockets, ...),
which is not good. - For "Response Succeeded Time (since last major LoadDatasets)", is n= a large number?
That indicates there have been lots of successful requests recently. This isn't trouble. It just means your ERDDAP™ is getting heavy use. - Is the "Number of non-Tomcat-waiting threads" double a typical value?
This is often serious trouble that will cause ERDDAP™ to slow down and eventually freeze. If this persists for hours, you may want to proactively restart ERDDAP™. - At the bottom of the "Memory Use Summary" list, is the last "Memory: currently using" value very high?
That may just indicate high usage, or it may be a sign of trouble. - Look at the list of threads and their status. Are an unusual number of them doing something unusual?
- Does the ERDDAP™ status page fail to load?
-
Is your institution's internet connection currently slow?
Search the internet for "internet speed test" and use one of the free online tests, such as https://www.speakeasy.net/speedtest/. If your institution's internet connection is slow, then connections between ERDDAP™ and remote data sources will be slow, and connections between ERDDAP™ and the user will be slow. Sometimes, you can solve this by stopping unnecessary internet use (e.g., people watching streaming videos or on video conference calls).
-
Is the user's internet connection currently slow?
Have the user search the internet for "internet speed test" and use one of the free online tests, such as https://www.speakeasy.net/speedtest/. If the user's internet connection is slow, it slows down their access to ERDDAP. Sometimes, they can solve this by stopping unnecessary internet use at their institution (e.g., people watching streaming videos or on video conference calls).
-
Stuck?
See our section on getting additional support.
-
Shut Down and Restart
-
How to Shut Down and Restart Tomcat and ERDDAP™
You don't need to shut down and restart Tomcat and ERDDAP if ERDDAP™ is temporarily slow, slow for some known reason (like lots of requests from scripts or WMS users), or to apply changes to datasets.xml file.You do need to shut down and restart Tomcat and ERDDAP™ if you need to apply changes to the setup.xml file, or if ERDDAP™ freezes, hangs, or locks up. In extreme circumstances, Java may freeze for a minute or two while it does a full garbage collection, but then recover. So it is good to wait a minute or two to see if Java/ERDDAP™ is really frozen or if it is just doing a long garbage collection. (If garbage collection is a common problem, allocate more memory to Tomcat.)
I don't recommend using the Tomcat Web Application Manager to start or shutdown Tomcat. If you don't fully shutdown and startup Tomcat, sooner or later you will have PermGen memory issues.
To shutdown and restart Tomcat and ERDDAP:
- If you use Linux or a Mac:
(If you have created a special user to run Tomcat, e.g., tomcat, remember to do the following steps as that user.)
-
Use cd tomcat/bin
-
Use ps -ef | grep tomcat to find the java/tomcat processID (hopefully, just one process will be listed), which we'll call javaProcessID below.
-
If ERDDAP™ is frozen/hung/locked up, use kill -3 javaProcessID to tell Java (which is running Tomcat) to do a thread dump to the Tomcat log file: tomcat/logs/catalina.out . After you reboot, you can diagnose the problem by finding the thread dump information (and any other useful information above it) in tomcat/logs/catalina.out and also by reading relevant parts of the ERDDAP™ log archive. If you want, you can include that information and see our section on getting additional support.
-
Use ./shutdown.sh
-
Use ps -ef | grep tomcat repeatedly until the java/tomcat process isn't listed.
Sometimes, the java/tomcat process will take up to two minutes to fully shut down. The reason is: ERDDAP™ sends a message to its background threads to tell them to stop, but sometimes it takes these threads a long time to get to a good stopping place.
-
If after a minute or so, java/tomcat isn't stopping by itself, you can use
kill -9 javaProcessID
to force the java/tomcat process to stop immediately. If possible, use this only as a last resort. The -9 switch is powerful, but it may cause various problems.
-
To restart ERDDAP™, use ./startup.sh
-
View ERDDAP™ in your browser to check that the restart succeeded. (Sometimes, you need to wait 30 seconds and try to load ERDDAP™ again in your browser for it to succeed.)
-
- If you use Windows:
- Use cd tomcat/bin
- Use shutdown.bat
- You may want/need to use the Windows Task Manager (accessible via Ctrl Alt Del) to ensure that the Java/Tomcat/ERDDAP™ process/application has fully stopped.
Sometimes, the process/application will take up to two minutes to shut down. The reason is: ERDDAP™ sends a message to its background threads to tell them to stop, but sometimes it takes these threads a long time to get to a good stopping place.
- To restart ERDDAP™, use startup.bat
- View ERDDAP™ in your browser to check that the restart succeeded. (Sometimes, you need to wait 30 seconds and try to load ERDDAP™ again in your browser for it to succeed.)
- Use cd tomcat/bin
- If you use Linux or a Mac:
Frequent Crashes or Freezes
If ERDDAP™ becomes slow, crashes or freezes, something is wrong. Look in ERDDAP's log file to try to figure out the cause. If you can't, please include the details and see our section on getting additional support.
The most common problem is a troublesome user who is running several scripts at once and/or someone making a large number of invalid requests. If this happens, you should probably blacklist that user. When a blacklisted user makes a request, the error message in the response encourages them to email you to work out the problems. Then, you can encourage them to run just one script at a time and to fix the problems in their script (e.g., requesting data from a remote dataset that can't respond before timing out). See <requestBlacklist> in your datasets.xml file.
In extreme circumstances, Java may freeze for a minute or two while it does a full garbage collection, but then recover. So it is good to wait a minute or two to see if Java/ERDDAP™ is really frozen or if it is just doing a long garbage collection. (If garbage collection is a common problem, allocate more memory to Tomcat.)
If ERDDAP™ becomes slow or freezes and the problem isn't a troublesome user or a long garbage collection, you can usually solve the problem by restarting ERDDAP™. My experience is that ERDDAP™ can run for months without needing a restart.
Monitor
You can monitor your ERDDAP's status by looking at the /erddap/status.html page, notably the statistics in the top section. If ERDDAP™ becomes slow or freezes and the problem isn't just extremely heavy usage, you can usually solve the problem by restarting ERDDAP™. There's additional metrics available through the Prometheus integration at /erddap/metrics.
My experience is that ERDDAP™ can run for months without needing a restart. You should only need to restart it if you want to apply some changes you made to ERDDAP's setup.xml or when you need to install new versions of ERDDAP™, Java, Tomcat, or the operating system. If you need to restart ERDDAP™ frequently, something is wrong. Look in ERDDAP's log file to try to figure out the cause. If you can't, please include the details and see our section on getting additional support. As a temporary solution, you might try using Monit to monitor your ERDDAP™ and restart it if needed. Or, you could make a cron job to restart ERDDAP™ (proactively) periodically. It may be a little challenging to write a script to automate monitoring and restarting ERDDAP. Some tips that might help:
- You can simplify testing if the Tomcat process is still running by using the -c switch with grep:
ps -u tomcatUser | grep -c java
That will reduce the output to "1" if the tomcat process is still alive, or "0" if the process has stopped.
- If you are good with gawk, you can extract the processID from the results of
ps -u tomcatUser | grep java, and use the processID in other lines of the script.