Wednesday, December 3, 2014

NetWorker EBR backup error "Internal Cycle Error"

Recently, there was a maintenance done in the VM environment.  All the EBR backup failed.  After rebooting the EBR appliance, we still experience intermittent issue with EBR backup.  Random VM backup failed with error " INTERNAL CYCLE ERROR: backstreamdir failed"

Working with support and eventually find out time is not in sync.  From the logs, they find out start time from NetWorker jobs seem to be about 10s off when other logs are checked from VM and EBR side.  We find out from our VM team time sync is turned off in VMtools for some reason and it should be disabled.  Because our NetWorker server and SN is not in domain, it is pointed to different NTP for time sync.  After confirming time sync issue is fixed, we don't experience any more issue.

So, always go from basic to troubleshoot VM backup like the past, are VMtools up-to-date?  ESX servers running the same version in the VM cluster?  Time is in sync?


Saturday, November 8, 2014

NetWorker new "save session distribution" option for load balancing

One of the new feature I discover after upgraded to NW 8.1.1.7 is "Save Session distribution".  There is one Networker server and 3 storage nodes to handle the backup.  NW server only handles index, bootstrap and a few Solaris client only.  Majority of the backup are sent to storage nodes.  When a pool is created, we make sure all 3 storage nodes setup a devices to the pool for load balancing.  In the past, you have certain clients go thru sn1 and the other go thru the other sn by setting the storage node affinity of clients.  Here are the few pts I found from the NetWorker training.

Max session

  • save sessions are distributed on each sn device's max session attribute (default)
  • more likely to concentrate the backup load on few storage nodes. 
Target session


  • save sessions are distributed based on each storage nodes device's target sessions attribute
  • more likely to spread backup across multiple storage nodes

P.S. Save session distribution is not available for clone or recover operations.

You can set this up from the server's properties.









This configuration can be overridden on a client basis within the client properties.








Lastly, make sure you enter all available and qualified sn in the storage affinity field of the client.  NetWorker will try to load balance the backup among the storage nodes defined in the storage node affinity field.


Migration from NetWorker 7.6 to 8.1.1.8 and dynamic nsrmmd

Finally completed the NW 7.6.5.7 to 8.1.1.7 migration and troubleshooting on Labour Day.  Backup devices are DD880 running DDOS 5.2.x.  Just have a chance to write the post now.  DFA works perfectly fine for the clients and backup window shortens quite a lot.  A new NetWorker 8.1.1.7 server is built and the same client instances are created.  So, no upgrade of existing 7.6.5.7 backup server or mmrecov on the new box.  Clients are moved in a group of 50 - 100 each week.  The old box was virtualized and waited for all saveset expired on the 7.6.5.7 box next yr since we don't set a long retention.

No issue comes up until last group of clients were moved few days before the Labour Day.  After rebooting the server, all NW services started up fine.  I ran a bootstrap backup, and it kept on waiting for media for a long time.  So, I just recycle NetWorker service, and backup was fine.  Few days later, same thing happens again.  Because of tight backup window, I just restarted the service and it was fine again.  So, I opened up a case with support to look for bugs, however, there was nothing similar to my situation.  When it happend again just before the Labour Day, follow routine troubleshooting steps, I discovered the nsrmmd count was not correct during the busiest backup period.  All nsrmmd should be used during the busy backup window but in fact, not all of them were used.  I turned off dynamic nsrmmd as a workaround, and it never happend again.  Once you turn off dynamic nsrmmd, you will see all nsrmmd of the device started.  Looks like NW does not call for more nsrmmd when they are needed.  To turn off dynamic nsrmmd (new feature in 8.x), goto NMC > Devices > Storage Nodes > open up the properties of each SN and de-select "dynamic nsrmmd".  I have not checked patch 8 and 9.  It maybe fixed by now.


When this issue occurs, because there are not enough nsrmmd to serve all the backup, lots of jobs are piled up.  If you run netstat, you will see a lot of TIME_WAIT because of that.

To count the # of nsrmmd, open up the properties of the device.


In this example, we have target sessions start from 1 and max sessions set to 20.  Keep in mind 1 nsrmmd is used for RO for restore session.  The remaining 5 nsrmmd will handle 20 sessions max for backup.  That is 1 nsrmmd handling max of 4 sessions.  I am a bit conservative and still use the ratio suggested in 7.6.1.  I know there are different numbers suggested.  However, I only need to run 1 restore every 2-3 days.  Only need to DR a box once - twice a yr.  So, 1 RO stream is enough.

If you want to achieve the best dedupe ratio for database backup, you can set your device to 1 to 1 for the sessions to nsrmmd ratio.  If you decide that device to support a max of 8 sessions.  Then the max nsrmmd count should be 9 in this case.

Thursday, August 14, 2014

savefs: Cannot retrieve the client resources

Recently, colleague has issue backing up a new client.  The following errors keep coming up.

90088:savefs: Cannot find the NSR group resource named 'Windows'.
90069:savefs: Cannot retrieve the client resources.

From my exp, these are DNS related error.  Check the Command Reference Guide. The 2nd possibility is client is moved to another group during backup.  I know this is not the case.  Check DNS and both forward and reverse lookup are fine.

Eventually, find out this new box is setup to replace another box.  The new box is setup with a different hostname but the alias is the same.  Compare the client instance between the new and the decommissioned one. Find out the alias is entered in the decommissioned client (if you attempt to enter alias on the newly created client, it should give you an error because you cannot have 2 different hosts with the same alias).

All we do is remove the alias that are used by the newly built box from the decommissioned client's aliases field.  Then enter it on the newly created client's aliases field.  Now, backup runs fine without any more issue.

(Another commonly see DNS related savefs error is "nothing to save".  Check and make sure aliases field include the other name if there is more than one aliases or one NIC on the host)

DD4500 Data Movement to archive tier uses all the resources

We have setup a DD4500 for our cloud.  Recently, we see backup running extremely slow with lots of timeout error on NetWorker.  Reboot the backup server, reset jobsdb and tmp do not fix the issue.
Suspect it is something similar to "Cleaning Affects Ingest in 5.2. User Cannot Throttle Down  kb182136".  So contact support and turn out it is the Data movement to a archive tier taking all the resource.

We attempt to change the throttle for Data Movement with help of support but it does not seem to help.  Only option is to upgrade to DDOS 5.4.3.0 or Disable Data Movement.  We suspend the Data movement as a workaround.  Then the backup speed is back to normal.

We finally complete the update of DDOS to DDOS 5.4.3.0, and we no longer see the issue again.

Monday, July 14, 2014

NetWorker 8.1.1 EBR VM backup

We deployed NetWorker 8.1 for one of our customer with DD4500 earlier this year.  It works pretty well.  DR and restore are easy.  However, there are a few things to know before you decide to deploy it.

1) EBR and NetWorker server must be resolvable in DNS server (both forward and reverse lookup).  It looks like adding to hosts file is not an option.  Maybe someone want to try hosts file and let me know if they can get it working without DNS.

2) Schedule with override does not seem to work.  For ex, I have schedule with incremental except 1st Fri of the week.  I find out those customized schedules with exception are not available for an "VMWARE action" in the policy.

3) Not enough activity shown in NMC and activity logging in NetWorker.  For ex, when I DR a box, I don't see much info from NMC.  I have to check thru vCenter GUI.

4) If you choose to backup to the internal storage on EBR, you cannot clone them.

5) There is no option to clone on demand.  It is based on policy.  All policy must start as a backup.  So, I cannot just clone them out with nsrclone command.

6) If you plan to backup your VM running SPS, SQL etc, it is not Microsoft application aware backup.  You still need to use NetWorker module for the backup.

7) Need to login to EBR to perform DR.

8) Each proxy can handle only 8 streams.  Deploy more proxy if you have a tighter backup window for your VMs.