Datadomain replication fails with „send queue: sending aborted by caller“

Last friday I struggled with some of our Dell EMC Datadomain (DD) Systems. I tried to establish a MTree replication between two identical systems.
The first few steps worked without an issue, so I was confident to get into the weekend early. But then things changed.

Ambiguous error message after replication setup

But first things first. Let’s start with the step by step process on how I enabled the replication.

Create Pair

After you added the destination Datadomain system under Managed Systems you can simply click on Create Pair to setup the replication context. As the source system is used by different backup applications, but only a certain part of it is to be replicated, we select the MTree replication type option.

Creating outbound MTree replication on source system

After you confirm the configuration with OK, the replication context is created. The system now independently sets up the necessary requirements for replication and hopefully concludes without errors.

Replication Pair successfully created

For a few seconds I was looking forward to the early hours. The joy, however, lasted only a brief moment. After a click on Close the overview page was updated and there it was. The nonspecific error message: send queue: sending aborted by caller.

Troubleshooting the issue

At first I excluded all typical sources of error: IP connection, DNS resolution, and so on. Then I searched the WWW for the error. To my surprise this was one of the rare cases in which the search engine did not deliver any results. Since the systems were not quite on the software level recommended by Dell EMC, I decided to update both the source system and the target system to the same software level (target code) supported by the backup applications and released by Dell EMC. From experience I knew that this would have been the first step after the opening of a service request at the vendor anyway. After about two hours I noticed that despite the update the replication is still running on the same error. So I decided to compare all replication settings on both systems again and then I found the root cause.

Configured Throttle Setting prevented the replication from initializing

Throttling was activated on the source system. Specifically, the schedule completely disabled replication in certain time periods. On the datadomain systems, the corresponding throttle value is set or overwritten at fixed times. The above schedule activates replication on Friday at 18:00 and deactivates it again on Monday at 02:00. If you take it very seriously, all entries except the entry on Friday and the first entry on Monday are superfluous, since they only reproduce the setting that was already made.

Since it was already shortly before 18:00 o’clock, I simply waited until the system releases the throttling by itself and waited a few minutes. Exactly at the moment the throttling was deactivated, the initialization started and the error disappeared.

All replication contexts are normal

Wrap-Up

An error from the category: Small cause, big effect. The behavior of the system is completely correct, but the error message could be a bit more self-explanatory. I was really surprised that I was obviously one of the first to fall into this trap. Hopefully this post will save somebody the unnecessary search for a self-created error.

Kommentar verfassen