Running into numerous stability issues on 2.2.x (suggestions?)

We have continued running into issues while attempting to keeping a database running 24/7. Without any improvements, we’ve started looking into what it would take to migrate to another database, but OrientDB would be my first pick if I could just get a stable database. (Here’s my previous post for background: Guide for running a distributed production database?)

The question is does anyone have any recommendations on what to use or not use? Are there known issues with the rest api, distributed mode, docker, or document size? Or does someone at least have a success story to confirm that we’re doing something abnormal that’s causing problems?

current downtime details...

Our database was just down because a create edge with a new class caused errors when propagating to other nodes. Previously it worked to shut down all nodes, and run the schema change over plocal, but this time it started hanging on Connecting to database [plocal:/.... So instead, I started just that single node, but had to reindex the entire database to start up, which took over 50 minutes.

I also noticed that in the logs there are reindexing errors because of duplicate ids on unique indexes. That shouldn’t happen considering we run all inserts on a single node using upserts.

I’m currently writing this while the 2nd node is joining with these writing chunk #468 offset=1316974592 size=2.14MB [OHazelcastPlugin] messages in the logs. I’ll probably be waiting another half hour before all 3 nodes are up. I just need this to stop happening.