Guide for running a distributed production database?

To put it briefly, has anyone had success running the community version in production at scale? If so, do they know of a guide that describes how to do so?

We have been running in distributed mode and I’ve read through the official documentation like this guide: https://orientdb.com/docs/2.2.x/Distributed-Configuration.html, but we keep running into different stability issues. Usually shutting down the entire cluster and starting one node at a time will fix it, but that still means the system is down for up to an hour. We’re running the docker image 2.2.37 with 3 nodes and the database is about 18 gb. We primary use the rest api.

Here are some issues we’ve been unable to solve:

  • If a node is starting up, then a 2nd is in backup mode, and the 3rd doesn’t have a quorum for writes. So if any node has issues, the entire database is effectively down.
  • If a node goes down once it’s been added to the cluster, it will usually not come back up without restarting the whole cluster.
  • One node (that we don’t write to) keeps getting corrupted in a way that it cannot rejoin. We have to restore from another node or a backup.
  • Crashes and memory errors sometimes corrupt the database and it needs to be exported and imported to recover (especially in our standalone testing databases).

Hi, can you share your server’s config and memory settings? Which version of orientdb are you using?

I’m using the the official docker image for 2.2.37. I haven’t changed any of the configs, but these are the environment variables for memory:

ORIENTDB_OPTS_MEMORY="-Xms512m -Xmx512m"
JAVA_OPTS_SCRIPT="-Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError -XX:MaxDirectMemorySize=1024M -Djava.awt.headless=true -Dfile.encoding=UTF8 -Drhino.opt.level=9"

The containers have 1536 mb of ram allocated and stay under 1300 mb used. Startup has these log messages for memory:

2019-06-17 17:53:36:044 INFO cgroup soft memory limit is 1610612736 B/1536 MB/1 GB [ONative]
2019-06-17 17:53:36:045 INFO cgroup hard memory limit is 1610612736 B/1536 MB/1 GB [ONative]
2019-06-17 17:53:36:045 INFO Detected memory limit for current process is 1610612736 B/1536 MB/1 GB [ONative]
2019-06-17 17:53:36:047 INFO OrientDB auto-config DISKCACHE=529MB (heap=494MB direct=1,024MB os=1,536MB) [OMemoryAndLocalPaginatedEnginesInitializer]