Monday, July 25, 2022

Isilon replication performance issue II

We have no issue with replication on those 2 big shares for about half a yr.  Replication is setup to run every 4 hours.  They normally complete within 20 min.  Suddenly one day, I receive email alert on the target Isilon.

"PDM degraded, too many operations alert"  

It will take longer and longer for the incremental replication to complete for the share with lots of files and deep folder structure.  Other shares are not affected.  Open a support ticket and confirm it is affected by bug 132337.   Support provided the workaround to the issue.  After applying the workaround, replication is back to normal.  

Final fix should be on OneFS 9.2.1.13 / 9.1.0.20.  

Sunday, July 24, 2022

Isilon replication performance issue I

We do have couple of shares required replication to target Isilon.  They consist of hundreds of millions of files within the share, and one of them is setup with deep folder structure.  We only setup limits on resource usage for replication but this is not good enough.  CPU is setup 25%.  Confirm only 10% can be consumed on the trunk for replication.  Worker is set to 33% Max all the time as recommended by vendor during initial setup.  








First, we made a mistake at the beginning and did not read through every word in the replication document.  

If the source share contains lots of files and / deep directory structure, Domain Mark job can consume a lot of resources on the first sync.  It is recommended to setup replication first and run initial sync with the box checked on “Prepare Policy for Accelerated Failback Performance” before migrating data to the share in the source.  

However, we do the complete opposite.  We migrate share from Veritas cluster to the source Isilon share.  Then, setup replication and enable sync for first time.  Once data copy completes and Domain Mark job starts, it starts to take away resource.  The application using the share is very sensitive to performance.  It will complain when it exceeds 25ms.  Sometimes, share performance jumps to 40ms.  We receive complaints from the application owner.  

The workaround suggested by support is to Set vfs.vnlru_reuse_freevnodes to 1.  This can be run on any nodes.  We see significant decrease in resource usage after.  (Pls check with support to confirm the cause before running any command)

 # isi_sysctl_cluster vfs.vnlru_reuse_freevnodes=1  

Once first sync and the first Domain Mark job completed successfully, we didn't have issue after.  


SNMP for Isilon

Try to setup SNMP for Isilon but don't see much document.  Hardware issue is monitored by vendor through eSRS and also alert us through SMTP.  

We try to use SNMP to monitor the share usage for NFS.  We run a few test to confirm the message Tivoli see when it exceeds the advisory and soft limit.  Also, the message it will see when it continue violating the limit.  

Main issue we discover is Tivoli does not receive the alert all the time.  Later on, we find out it can send out SNMP from any of the node in the cluster.  With our Isilon setup, we have at least 5 zones on different subnet.  Because of firewall rules, not all of the subnet can be reached by SNMP manager.  Because we have 3 types of nodes, Flash nodes (serving FE traffic), Hybrid nodes (serving some less critical application and replication) and Archive nodes (backup traffic only), we decide to use Archive nodes for monitoring.  We confirm SNMP can reach backup subnet.  Then, modify Isilon Alert section.  Create a SNMP Monitoring channels.  


Modify only Archive nodes can send out SNMP alerts (see below).  In my setup, it is nodes 1 - 4.  Now, SNMP alert will only be sent through archive nodes which has no issue reaching the SNMP manager.  



CertUtil to verify MD5 and SHA256 checksum

When we download files from vendor, we use 3rd party tool to verify the checksum. With Windows 10, you can use CertUtil. 

1) Open Command Prompt in Windows Desktop 
2) Then, enter command CertUtil -hashfile "path to the file" hash-function-type 

Below show the example of MD5 and SHA256. No need to download 3rd party tool to verify the checksum.