The Pardot application experienced a service disruption from Friday, July 8 at 9:35am UTC until 11:47am UTC. The root cause of this service disruption was mis-application of network access rules by an automated process. These access rules secure the Pardot production environment and incorrectly restricted communication between critical systems.
This service disruption resulted in degraded performance of the Pardot application. During this time, Pardot users could access the Pardot login page, but other pages in the application were inaccessible. Publicly available Pardot-hosted assets such as forms and landing pages may also have been unavailable during the disruption.
On Friday 7/8 at 9:49 a.m. UTC, our automated monitoring systems alerted us to higher than normal page load times in the Pardot application.
Upon receiving the alerts, we immediately initiated our incident response process. We observed our production servers were not able to resolve DNS queries, which is required for proper internal communication between the different systems that support the Pardot services.
Coordinating with the broader Salesforce Site Reliability and Network Engineering teams, we identified a routine update that had recently changed network access control lists (ACLs) in the Pardot environment. This update was part of an automated process intended to ensure the integrity and correctness of these rules. Due to a change in naming convention for a subset of servers supporting the Pardot services, this automated process inadvertently deployed a set of ACLs that were still being developed and tested.
At 10:14 a.m. UTC, we confirmed that these updates to ACLs were in progress and the likely cause. At 10:39 a.m. UTC we began rolling back the ACL updates to their prior known-good state. The rollback process completed at 11:47 a.m. UTC, after which time no further impact was observed and an “all clear” was given.
Remedy & Future Prevention
To prevent this issue in the future, we are prioritizing the following improvements:
We will work with the Network Engineering team to implement additional verification during the automated ACL update process. We will re-verify the ACLs that caused this incident and address issues that would arise from any future deployment of these ACLs. We will investigate implementing more automated health-checking during the ACL change process. This will enable us to quickly initiate a rollback upon a degradation in service health.
We thank you for your patience during this incident and for your continued trust in us.
Zach Bailey, Sr. Director of Software Engineering