Monday, March 2, 2026

NDFC upgrade / deploy part 2 - sizing and compatibility

At the time we were planning, we decided to use ND 3.2.1i (which comes with NDFC 12.2.2).  That would work with all our existing linecard, switches and firmware.  I don't need to worry about existing firmware 8.4(2e) was not supported.  Please use links below for compatibility check.  

For software / hardware compatibility, pls check NDFC Software and Hardware Compatibility Matrix.

To confirm the NDFC and ND compatibility, pls check Cisco Nexus Dashboard and Services Compatibility Matrix    

Next is sizing.  For sizing, pls check Cisco Nexus Dashboard Capacity Planning.  We decide to use 1 App node config since it is a small environment.  (Note: if one node config is decided, it will not support adding any more node in the future).  App OVA requires 16 vCPU and 64GB memory.  

Other requirement: pls go through documents below

Cisco Nexus Dashboard and Services Deployment and Upgrade Guide, Release 3.2.x - Prerequisites: Nexus Dashboard [Cisco Nexus Dashboard] - Cisco


Device Manager error in Cisco

Recently, when we try to open Device Manager to make changes on the MDS 9710 switch, the error below pops up.  It only happens to one of the sites.  The other site works fine.  We try to open device manager thru NDFC and get the same error.  





We did see a different error in the past and needed to switchover the controller in the past.  The error was "Busy network, no route, or snmpd is unresponsive."

See EMC kb 000218149.  However, it still fails to open.  

We suspect something with network but still open a ticket with Cisco.  Support runs some trace and does not see issue on the switch side and there is connectivity between switch and the client running Device Manager.  Eventually, we workaround the issue by confirming TCP is set to true for DeviceManager.bat.

set JVMARGS=%JVMARGS% -Dsnmp.preferTCP=true

We still have trouble after that.  So, we login to the switch in the other site that we don't have issue with.  Then click Device > Preference.  






Select TCP for Use SNMP on next launch. Click Apply then Ok.  

Now, try to use Device Manager on the switch that we have issue.  Now it works fine.  Only problem with the workaround is to have a switch that can connect to Device Manager without issue.  I have not asked support if there is any other way to force it to use TCP instead of UDP.  


NDFC bugs since deployed

NDFC was deployed in last Oct.  Couple of bugs were discovered.  

1) About 4 - 6 weeks, we will see an alert 

Elasticsearch error - 'could not fetch component status'

The bug ID is CSCwm51621.

If you follow the bug ID above, there is a solution from the forum.  For me, when that error comes up, I just reboot the appliance with "acs reboot".  The error will go away, and it won't come up until another 4 - 6 weeks.  A normal reboot is sufficient and DO NOT add any other option after "acs reboot".  

2) /logs/k8/pods 90% usage alert.  About 4 months after deployed, an alert /log/k8/pods 90% usage showed up in the Admin Console of the Nexue Dashboard.  






Support was contacted to clear the old logs.  Currently, there is no fix, and webex is required for support to clear the old logs manually.  In the future release, the log retention for the folder will be changed based on the chat with support.  

Tuesday, December 30, 2025

NDFC upgrade / deploy part 1 - License planning

With DCNM reaching EOL in Apr 2026, the only option left is to upgrade to NDFC.  Some site may decide to run with DCNM until the whole infrastructure is migrated to cloud.  Because we are still refreshing old MDS 9700 switches, migrating to NDFC is a must.  The main issue with firmware upgrade for existing switch is smart license.  We manage no more than 10 MDS FC switches at a time.  So, setup is kind of simple.  Still, it is more complicated compared to DCNM.

It looks like 9.2(1a) is the last version of firmware in MDS 9000 supporting legacy license / license file.  If you upgrade existing switch running older version of firmware to 9.2.2, the switch license will be changed to smart licensing auto.  Check with support.  The answer I got is those existing switch upgraded to newer firmware will be in Donor mode in the worst case.  There should be no impact to functionality according to support.  Because some of the MDS switches licenses were purchased through 3rd party vendor, and we change to another 3rd party vendor for support later, I don't know what will happen if there is licensing issue after firmware upgrade on existing switch.  So, we decide to keep existing older MDS 9700 switches with firmware 8.4(2e) until hardware refresh completes in the next 2-3 yrs.  There are a few bugs impacting firmware 8.4.  Confirm you have all the workarounds before deciding to stay in 8.4 firmware.  If you decide to upgrade existing MDS 9700 to use smart license, check with support first.   

For all the new MDS 9700, they are all shipped with firmware 9.4.x.  So, smart license is enabled auto.  Check with support and there is no license file any more on these new MDS 9700 switches.  License is installed at the factory.  

Because it is a close environment, we setup our NDFC smart licensing to offline mixed mode.  Mixed mode will support existing license file in older MDS 9700 with older 8.4 firmware.  We contact support to transfer existing legacy DCNM server license to smart license before upgrade.  At most 365 days / switch addition or decommission, I will export the license info back to Cisco software support website and then import it back to the NDFC server licensing section.  

Below are some of the Cisco Smart licensing links / doc for your information.  

Brownfield_Conversion_QRG

Cisco MDS 9000 Series Licensing Guide, Release 9.x - Smart Licensing Using Policy [Cisco MDS 9000 NX-OS and SAN-OS Software] - Cisco

Cisco MDS Smart Licensing Using Policy Data Sheet


Saturday, April 19, 2025

DCNM login error "RequestSendFailed: EJBCLIENT000409" for the SAN Client

We have plan to build a new NDFC to manage our MDS switches.  It is more complex and will take some time to deploy.  Few challenges are resource requirement, license transfer and other new requirement.  In the meantime, we continue to use existing DCNM 11.5(4).  

Last week, suddenly, there is a login error with the java SAN client "RequestSendFailed: EJBCLIENT000409".  We have restarted the services and reboot the DCNM server.  However, it does not help.  





Do some research and most likely it is related to certificate.  We don't have our own certificate.  So, most likely, the default expired.  It is confirmed by checking the certificate expiration date from the web browser.  So, open a ticket with support and ask them to renew the certificate for 2 more years.  Then, I can login without issue.  

Thursday, April 17, 2025

Migration from VMAX 40k to PMAX 8000 and 8500

Just completed the storage migration from VMAX to PMAX 8000 and 8500 before Christmas.  NDM is not an option because NDM no longer supports Solaris.  Besides, there are a number of busy databases running in both test and production.  Before cutover to PMAX occurs, the SRDF directors between VMAX and PMAX will be the bottleneck.  Also, we do not have any extra director left to config for SRDF traffic.  

Our plan is to use Storage vMotion for all VMDK.  The RDMs are for SQL DB running in ESX.  For the RDM, we use SRDF/S to start replication to PMAX in the background.  Because there is no SRDF supported from VMAX to PMAX 8500, all luns migration using SRDF will be replicated to PMAX 8000.  Then, app owner will pick a downtime to shutdown the app and the servers.  We will stop the sync after confirming no outstanding tracks.  Remove the VMAX luns from the initiator group then add the PMAX luns.    

For Solaris, if it is Oracle DB, same size or larger luns will be added to Oracle.  Then DBA will complete the balancing and drop the old VMAX luns.  For the boot luns and luns from other app, some of our Unix admin will use SRDF/S to migrate them to PMAX 8000.  Some decide to do host base migration.  For host base migration, we just provide a lun of same size or larger to the Unix admin from PMAX 8500.  

Below summarizes the general steps for the storage migration from VMAX to PMAX 8000 using SRDF/S.  Pls check your environment and test to see if additional steps are required.  If Volume Manager is used, migration should be complete with Volume Manager rather than SRDF.  

1) Change the source luns attribute in VMAX to dyn_rdf.  
2) setup SRDF/S pair from VMAX to PowerMax 8000 (put the target lun in temp target_SG)
3) during downtime, shutdown the apps and servers  
4) Confirm no outstanding tracks.  
5)     Perform SRDF split.  
6) Remove source lun from VMAX storage group   
7) Add the target luns to SG in PowerMax and remove from temp target_SG.
8) Host team completes lun mapping 
9) Delete SRDF pair with force option 
10)  Unset GCM bit if required  (symdev -sid xxx -devs xxx unset -gcm)
11)  Host team can perform rescan if setp 10 is required.  (They should see about 1MB more space for luns in step 10)  
12)  Power up the servers to validate

That way, they can always go back to the VMAX luns if backout is required.  

Note: in the past, if the source / target luns are not mapped to FE ports, sometimes will see some strange results on some of the SRDF operation.  So, I create a temp target_SG with no HBA in the IG for the target_SG's masking view.  



Isilon NDMP backup failed after NetWorker upgraded to 19.11

We need to update firmware of existing Isilon to 9.7.1.x to add new Isilon nodes as recommended by support.  Before we do that, we need to confirm about NDMP backup with Isilon.  Because we have NDMP backup of Isilon by NetWorker, NetWorker upgrade to 19.11 from 19.10.0.4 is required.  After NetWorker upgrade in Feb, some of the NDMP backups failed with error message "Hostname resolution failed ".   Multiple retries will work.  However, it is very annoying.  

After working with support, there is new change to NetWorker in 19.11.  See kb NetWorker: server upgraded to 19.11, backup fails reporting "Hostname resolution failed" | Dell US

None of the workaround in the kb above will fix the issue on Isilon.  The forward DNS lookup for Isilon are actually forwarded to Smartconnect SIP from DNS server.  Only option is to add a reverse entries to the DNS server for all the Isilon nodes that handle NDMP backup.  Refer to link SmartConnect and Reverse DNS | Dell PowerScale: Network Design Considerations | Dell Technologies Info Hub.  

So, pointer records for all Isilon nodes handling NDMP backup will be created on the DNS server for the NDMP zone name.  Once that is done, all backup is fine.  Keep in mind there is no change on the forward lookup which is still handled by Smartconnect SIP.  Not sure if there is a fix now.  

Tuesday, April 15, 2025

Update Isilon to 9.7.1.4 from 9.4.0.14 to add new A300, H700 and F710 nodes

Main reason for the upgrade in Jan is to replace existing A200, H500 and F800 with new Isilon nodes A300, H700 and F710 nodes.  Support recommends to update firmware of existing nodes to 9.7.1.4 before adding new Isilon nodes to the pool.  

Things went smoothly during upgrade.  So far no issue.  Right after upgrade, I need to reconfigure the following settings again.  

1) SNMP node restriction got reset.  Manually select node 1-4 again (In our environment, we only allow SNMP trap from A200 nodes.  The A200 nodes handle only NDMP backup; the only VLAN for those A200 nodes)






2) SNMP v3 is new feature and selected auto in alert channel for SNMP.  Manually select SNMP v2 for our environment













3) SMTP setting is reset back to the manual settings (nothing required since it is populated with the same SMTP info)

New features

1) IPv6 feature is new and enabled by default

2) New feature for Transfer Limit at 90% for spillover pool.  I guess it will not fill it up if pool is 90% full and there is other pool with space less than 90%.

3) Support Assist is required for SCG in future OneFS.  

4)     Firewall 



Wednesday, December 25, 2024

Create partition with Linux

Sometimes, for some odd reason, there is issue deleting and reforming partition on USB disk.  I have to use Linux to complete the task.  Again, this will erase everything on the USB disk.  Assume the device handle for the USB jump drive is /dev/sdb.  

1) run "sudo fdisk /dev/sdb"
2) type "o" to create new dos partition table then hit enter
3) type "n" to create a new partition then hit enter
4) hit enter if you are ok with default settings
5) type "t" then "7" to create a exFAT partition
6) type "w" to save the change then "q" to exit.  
7) run "sudo mkfs.exfat -n "label" /dev/sdb1




Thursday, July 25, 2024

Isilon upgrade from 9.1.0.x to 9.4.0.14 to fix NFSv4 lock issue

We discover the NFS lock issue in OneFS 8.x with MQ.  We have MQ with message queue stores in the Isilon NFS share.  When Isilon node reboots due to maintenance, the NFS lock does not fail over to the remaining Isilon node correctly.  That causes MQ hung.  Sometimes, server requires reboot to fix the issue.  There are multiple Isilon fixes to it.  Eventually, this is fixed in 9.4.0.3.  We update to 9.4.0.14 and confirm issue is fixed.  If MQ is running in Solaris OS, pls confirm Solaris is running at least 11.4 SRU69. 

After upgrade to 9.4.0.14, some settings are reset.

1) Spillover pool reset (set to Anywhere)

2) SNMP node limit reset (we only allow SNMP alerts sent from NDMP nodes.  That is A200 in our environment).  I need to reconfigure it.  

New problem comes up after upgrade.  NDMP random backup fails with memory leak

New feature data inline introduced in 9.3 and enabled by default.  After a month of upgrade to 9.4, there is random NDMP backup failure due to memory leak.  Eventually, turn off data inline with support assistance.  Then reboot all the nodes running NDMP backup one by one to workaround the issue to clear the dedupe cache.  Permanent fix will be available in 9.7.1.x.