Timeout (1000ms) on acquiring lock on record #-1:-1 on server 'DeadLock'. It is locked by request 0.168809

Hey guys,

we are currently facing the following error when inserting data in our preproduction database since we are using orientdb in a multi cluster architecture:
com.orientechnologies.orient.server.distributed.task.ODistributedRecordLockedException: Timeout (1000ms) on acquiring lock on record #-1:-1 on server ‘DeadLock’. It is locked by request 0.168809

It is currently a blocker for us to use orient in production as it is unable to solve the deadlock without restarting the db.

In order to maintain consistency we use transactions to insert/change a couple of vertices in a tree. The insertions are triggered by javascript functions (to reduce network traffic). We have a lot of parallel insertions in a short amount of time as we are using a RabbitMQ delivering input data of multiple customers.

The first thing I was asking myself was: As I am using pessimistic locking I would not expect that deadlock situation at all. So what is causing this behaviour and how can I avoid it?
I found some GitHub issues and StackOverflow articles about that issue and the only info I was able to extract was that it has somehow to do with the execution mode. Right now we have mode ‘undefined’ set (which is default according to the documentation). According to Luca Garulli who answered in one of the stackoverflow articles, “undefined means synchronous, unless you specify something different via API” in version > 2.2. But we are using 3.0.13 and it is still not working.

Thanks in advance
Jon

Hi @jonsalvas

do you have the entire stacktrace of the exception?

@jonsalvas

how are you triggering the javascript functions? via HTTP?

Thanks for your fast reply.
Exactly we are executing the functions via HTTP.

Regarding the Exception: this is the only trace I got. Sometime I am however also receiving this:

 com.orientechnologies.common.concur.lock.OLockException: Time out acquire lock for resource: '#36:38' 
	at com.orientechnologies.common.concur.lock.OSimpleLockManagerImpl.lock(OSimpleLockManagerImpl.java:37)
	at com.orientechnologies.orient.server.distributed.impl.ONewDistributedTxContextImpl.lock(ONewDistributedTxContextImpl.java:51)
	at com.orientechnologies.orient.server.distributed.impl.ODatabaseDocumentDistributed.acquireLocksForTx(ODatabaseDocumentDistributed.java:544)
	at com.orientechnologies.orient.server.distributed.impl.ODatabaseDocumentDistributed.internalBegin2pc(ODatabaseDocumentDistributed.java:731)
	at com.orientechnologies.orient.server.distributed.impl.ODatabaseDocumentDistributed.commit2pc(ODatabaseDocumentDistributed.java:657)
	at com.orientechnologies.orient.server.distributed.impl.task.OTransactionPhase2Task.execute(OTransactionPhase2Task.java:92)
	at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin$1.call(ODistributedAbstractPlugin.java:658)
	at com.orientechnologies.orient.core.db.OScenarioThreadLocal.executeAsDistributed(OScenarioThreadLocal.java:69)
	at com.orientechnologies.orient.server.distributed.impl.ODistributedAbstractPlugin.executeOnLocalNode(ODistributedAbstractPlugin.java:654)
	at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.onMessage(ODistributedWorker.java:347)
	at com.orientechnologies.orient.server.distributed.impl.ODistributedWorker.run(ODistributedWorker.java:121)
Error beginning timed out transaction: 0.238

And you are doing some kind of load balancing via HTTP right?

We are running orientdb in Kubernetes and orient is behind a service (which is somehow a loadbalancer) over multiple orient pods/nodes.

Update: we tried to directly send the data to one of the nodes in the cluster but now we receive:

com.orientechnologies.orient.server.distributed.task.ODistributedOperationException: quorum not reached, responses: [success,exception: 'Cannot create the record #139:112 because the assigned RID was #139:67 instead

We have now tried to overcome this issue by implementing a retry logic which retries the transaction up to 100 times if ONeedRetryException is thrown. It seems that even though 100 retries are performed the same error occurs over and over again.
On each retry we receive
2019-08-02 20:16:50:631 WARNI [xxx-odb-orientdb-0] Timeout (1210ms) on waiting for synchronous responses from nodes=[xxx-odb-orientdb-0, xxx-odb-orientdb-2, xxx-odb-orientdb-1] responsesSoFar=[xxx-odb-orientdb-0] request=(id=0.1140 task=TxPhase1 user=#191:0) [ODistributedDatabaseImpl]

Anyone an idea what is causing this issue? It is a blocker for us to use orient in production.