My VM, SAN and NetWorker blog: Isilon

Showing posts with label Isilon. Show all posts

Monday, August 28, 2023

Isilon multiscan moves data between tiers unexpectedly

After Isilon was implemented 2.5 yrs ago, we decided to stop SmartPool jobs and use combination of IndexUpdate and FilePolicy job to save resources on the Isilon. We have 4 SSD nodes handling front end traffic, 4 hybrid nodes handling replication plus some test traffic and 4 SATA nodes handling the backup. Initially, SATA nodes were deployed with insufficient memory and caused couple of outages during NDMP backup. After max the memory on SATA nodes, there are no more issue after that.

Because we want to keep as much data as possible on SSD, we won't run FilePolicy until SSD used space is over 70%. After firmware update (nodes reboot), we saw data move between Tiers with multiscan kicking in auto. That's about 2 yrs ago. Checked with support and they said it's ok to kill it. So, we just killed the Mutliscan job. Recently, we finally had our first disk failed on the production Isilon and Multiscan was kicked in auto. Same as before, we saw data moving unexpectedly to SATA tier. I suspected if that's related to FilePolicy. However, I was told it should not by support. After case with support for a long time, we finally got the suggestion from higher level of support to run SmartPool job periodically. So, I play with the DR Isilon since it is not busy. Using IndexUpdate and FilePolicy job to move the data and compare the tier usage multiple times, each tier finally reached to the utilization I want 60-70% for SSD and below 60% for SAS pool (the spillover pool). Then, kick off MultiScan job. Now, I don't see any more data move between tiers with MultiScan job.

In the future, I will adjust the FilePolicy each month and then kick off the SmartPool job once a month just in case Multiscan is started due to failed HDD or node replacement.

===========================================================================

Update Sep 25, 2023

For Production Isilon, after IndexUpdate completes, I run FilePolicy then Multiscan job. I still see data moves to SATA pool. So, I run IndexUpdate -> FilePolicy -> SmartPool job. Same things. Looks like there is discrepancy between IndexUpdate + FilePolicy and SmartPool job.

This time, I just run SmartPool job and kill it once SATA pool reaches 85%. Then adjust the FilePolicy and rerun it SmartPool job. After 3 tries, I finally see the results I want. Why there is a discrepancy between FilePolicy and SmartPool job, I have no idea.

From now on, I will adjust the FilePolicy once every 2 months and run the SmartPool job. For DR Isilon, since it has far less data, I will adjust FilePolicy once every 3-4 month and then run SmartPool job.

Monday, July 25, 2022

Isilon replication performance issue II

We have no issue with replication on those 2 big shares for about half a yr. Replication is setup to run every 4 hours. They normally complete within 20 min. Suddenly one day, I receive email alert on the target Isilon.

"PDM degraded, too many operations alert"

It will take longer and longer for the incremental replication to complete for the share with lots of files and deep folder structure. Other shares are not affected. Open a support ticket and confirm it is affected by bug 132337. Support provided the workaround to the issue. After applying the workaround, replication is back to normal.

Final fix should be on OneFS 9.2.1.13 / 9.1.0.20.

Sunday, July 24, 2022

Isilon replication performance issue I

We do have couple of shares required replication to target Isilon. They consist of hundreds of millions of files within the share, and one of them is setup with deep folder structure. We only setup limits on resource usage for replication but this is not good enough. CPU is setup 25%. Confirm only 10% can be consumed on the trunk for replication. Worker is set to 33% Max all the time as recommended by vendor during initial setup.

First, we made a mistake at the beginning and did not read through every word in the replication document.

If the source share contains lots of files and / deep directory structure, Domain Mark job can consume a lot of resources on the first sync. It is recommended to setup replication first and run initial sync with the box checked on “Prepare Policy for Accelerated Failback Performance” before migrating data to the share in the source.

However, we do the complete opposite. We migrate share from Veritas cluster to the source Isilon share. Then, setup replication and enable sync for first time. Once data copy completes and Domain Mark job starts, it starts to take away resource. The application using the share is very sensitive to performance. It will complain when it exceeds 25ms. Sometimes, share performance jumps to 40ms. We receive complaints from the application owner.

The workaround suggested by support is to Set vfs.vnlru_reuse_freevnodes to 1. This can be run on any nodes. We see significant decrease in resource usage after. (Pls check with support to confirm the cause before running any command)

# isi_sysctl_cluster vfs.vnlru_reuse_freevnodes=1

Once first sync and the first Domain Mark job completed successfully, we didn't have issue after.

SNMP for Isilon

Try to setup SNMP for Isilon but don't see much document. Hardware issue is monitored by vendor through eSRS and also alert us through SMTP.

We try to use SNMP to monitor the share usage for NFS. We run a few test to confirm the message Tivoli see when it exceeds the advisory and soft limit. Also, the message it will see when it continue violating the limit.

Main issue we discover is Tivoli does not receive the alert all the time. Later on, we find out it can send out SNMP from any of the node in the cluster. With our Isilon setup, we have at least 5 zones on different subnet. Because of firewall rules, not all of the subnet can be reached by SNMP manager. Because we have 3 types of nodes, Flash nodes (serving FE traffic), Hybrid nodes (serving some less critical application and replication) and Archive nodes (backup traffic only), we decide to use Archive nodes for monitoring. We confirm SNMP can reach backup subnet. Then, modify Isilon Alert section. Create a SNMP Monitoring channels.

Modify only Archive nodes can send out SNMP alerts (see below). In my setup, it is nodes 1 - 4. Now, SNMP alert will only be sent through archive nodes which has no issue reaching the SNMP manager.

Saturday, March 26, 2022

Isilon NDMP throttle

There are 3 shares in the Isilon with very complex folder structure. They are deep and contain lots of small files. Again, you need to follow tech spec guide. You cannot have more than 1 million files per folder. We have a NFS share with over 100,000,000 files but we have a folder created under the share each year.

Only issue is backup. When NDMP backup is run, it uses a lot of CPU when it reaches some folders. It can reach 80% CPU usage. Same thing when we use Datadobi to migrate to Isilon. When it reaches those folders, scan speed will drop by 50%.

I have Gen 6 nodes with OneFS 9.1. From the doc below, it requires Gen 6 nodes and the command is available in OneFS 8.2 (see link below). Looks like this setting can be modified thru CLI only. At least with 9.1, I don't see it in the GUI.

Isilon OneFS Help

1. Run the following command through the command line interface to enable NDMP Throttler:

isi ndmp settings global modify --enable-throttler true

2. View the setting by running the following command:

isi ndmp settings global view

3. Default settings are 50% for CPU throttle threshold. I use command below to change it to 65.

isi ndmp settings global modify –throttler-cpu-threshold 65

Isilon spare settings

Open SmartPools settings

Enable Global Spillover pool if you have more than 1 pool.

Make sure Virtual hot spare is selected. Our environment is small and use 1 virtual drive. For bigger environment, use % of total storage instead. EMC tech suggests 10% of total storage.

Sunday, January 17, 2021

NFSv4 for Isilon settings issue

Recently try to setup NFS share for Isilon where NFS server is NFSv4. We have issue mounting the share even with non-root user, and see permission nobody/nobody. That does not happen with NFSv3. The NFS server belongs to a domain. To fix the issue, enter the domain name in NFSv4Domain field. However, we have a server belongs to multiple domains. We test it by setting it back to localdomain in NFSv4 and enable NFSv4 no name. Now, we don't have any issue with mount for NFS share in Isilon.

My VM, SAN and NetWorker blog