Saturday, November 8, 2014

Migration from NetWorker 7.6 to 8.1.1.8 and dynamic nsrmmd

Finally completed the NW 7.6.5.7 to 8.1.1.7 migration and troubleshooting on Labour Day.  Backup devices are DD880 running DDOS 5.2.x.  Just have a chance to write the post now.  DFA works perfectly fine for the clients and backup window shortens quite a lot.  A new NetWorker 8.1.1.7 server is built and the same client instances are created.  So, no upgrade of existing 7.6.5.7 backup server or mmrecov on the new box.  Clients are moved in a group of 50 - 100 each week.  The old box was virtualized and waited for all saveset expired on the 7.6.5.7 box next yr since we don't set a long retention.

No issue comes up until last group of clients were moved few days before the Labour Day.  After rebooting the server, all NW services started up fine.  I ran a bootstrap backup, and it kept on waiting for media for a long time.  So, I just recycle NetWorker service, and backup was fine.  Few days later, same thing happens again.  Because of tight backup window, I just restarted the service and it was fine again.  So, I opened up a case with support to look for bugs, however, there was nothing similar to my situation.  When it happend again just before the Labour Day, follow routine troubleshooting steps, I discovered the nsrmmd count was not correct during the busiest backup period.  All nsrmmd should be used during the busy backup window but in fact, not all of them were used.  I turned off dynamic nsrmmd as a workaround, and it never happend again.  Once you turn off dynamic nsrmmd, you will see all nsrmmd of the device started.  Looks like NW does not call for more nsrmmd when they are needed.  To turn off dynamic nsrmmd (new feature in 8.x), goto NMC > Devices > Storage Nodes > open up the properties of each SN and de-select "dynamic nsrmmd".  I have not checked patch 8 and 9.  It maybe fixed by now.


When this issue occurs, because there are not enough nsrmmd to serve all the backup, lots of jobs are piled up.  If you run netstat, you will see a lot of TIME_WAIT because of that.

To count the # of nsrmmd, open up the properties of the device.


In this example, we have target sessions start from 1 and max sessions set to 20.  Keep in mind 1 nsrmmd is used for RO for restore session.  The remaining 5 nsrmmd will handle 20 sessions max for backup.  That is 1 nsrmmd handling max of 4 sessions.  I am a bit conservative and still use the ratio suggested in 7.6.1.  I know there are different numbers suggested.  However, I only need to run 1 restore every 2-3 days.  Only need to DR a box once - twice a yr.  So, 1 RO stream is enough.

If you want to achieve the best dedupe ratio for database backup, you can set your device to 1 to 1 for the sessions to nsrmmd ratio.  If you decide that device to support a max of 8 sessions.  Then the max nsrmmd count should be 9 in this case.

No comments: