Engineering Division - BCP Engineering - Associate - Bengaluru
Goldman Sachs
Bengaluru, Karnataka, India
The Engineering Resiliency and Recovery Specialist Engineer manages resiliency and recovery testing for critical applications within GS's regional Engineering division and Business Units. The ETO Recovery Engineering team establishes policies, governance, and standards to ensure business resilience and service continuity are properly verified.
The Resiliency and Recovery Specialist collaborates closely with Technology Infrastructure, Application, Risk Management, and Corporate Services teams to coordinate and ensure the seamless execution of Technology events, including building powerdowns, Data Center failover tests, and Disaster Recovery testing. Additionally, this role participates in team projects aimed at enhancing the effectiveness of these programs across the US, Asia, Bengaluru, and EMEA regions.
Resiliency BCP testing ensures essential business functions continue during emergencies of any kind. The Resiliency and Recovery Specialist develops and tests plans to reduce disruptions and protect the firm's operations, reputation, and financial stability.
In this capacity the individual will develop scenario based test events and verify recovery plans against them. The Engineering Resiliency and Recovery Specialist will work closely with Technology Infrastructure, Risk Management, Corporate Services and application development teams to coordinate (plan) and ensure the smooth execution of Technology (application and Infrastructure) events such as pandemic tests, concentration testing, Disaster Recovery testing as well as work on projects to improve the effectiveness of such programs across all regions (Americas, Asia, Bengaluru and EMEA).
III. OUR IMPACT Division DescriptionTHE TECHNOLOGY DIVISION
Our team of engineers builds solutions to the most complex problems. We develop cutting-edge systems and processes that form the core of our key business and enable transactions to move in milliseconds. We provide real-time access to critical deal information and crunch billions of data points each day to inform firm-wide market insights and strategies. Team members have the opportunity to work at the forefront of technology innovation alongside industry leaders and make significant contributions to the field.
Team Description:ENTERPRISE TECHNOLOGY OPERATIONS – ENGINEERING Resiliency and Recovery Specialist
Engineering Resiliency and Recovery Specialist is responsible for the strategic initiatives to reduce risk and improve resiliency throughout the operational organization. BCP ensures effective Engineering recovery plans are in place and in compliance with the firm’s overall resiliency strategies to ensure continuity of operations in crisis events. BCP provides a platform for Engineering to validate recovery strategies across people, products, platforms and functions. To reduce resiliency risk and validate recovery strategies, BCP drives adoption of various Core and ETO platforms to drive adherence to controls, automate recovery plans, and reduce manual work by utilizing various Core and ETO products. Furthermore, the organization identifies applications in scope for resiliency and recovery testing, tracks exemptions and maintains evidence for Business Continuity test credit. BCP addresses the recovery sequencing problem by use of Topology mapping between applications and infrastructure and derives the recovery order and identification of key dependencies during outages.
IV. HOW YOU WILL FULFILL YOUR POTENTIAL- Lead Infrastructure and Application Disaster Recovery testing and Data Center Power-down events
- Drive adoption of the mandated controls which are in place with application teams.
- Provide guidance to application owners on how they can adapt a recovery procedures to adhere to the uplifted controls in place.
- Disaster Recovery tests scope events to include the interdependencies of shared services, up-steam and downstream application dependencies, Order of recovery, etc.
- Cyber Attack Recovery Testing. Driving teams to become resilient and have the ability to recover during a cyber attack. Test the cyber attack recovery procedures.
- Power-down events establish critical milestones, establish order of recovery, verify dependency of various infrastructure components
-
Coordinate and manage regulatory resiliency recovery tests, such as SIFMA's industry-wide exercises, SPOOR-related tests, and those guided by the Monetary Authority of Singapore (MAS), to ensure compliance with industry standards and regulatory requirements. This involves liaising with various internal & external teams, scheduling test activities, monitoring progress, and documenting outcomes to support robust audit and risk management processes
- Identify gaps in process and procedures and enhance those processes.
- Identify opportunities for automation
- Oversee and Manage the execution plans
- Initiate inventory, infrastructure & Application ready for business checks
- Manage incidents and escalations related to the activities we perform.
- Bachelor Degree
- Minimum 4-5 years of experience in technology stack including infrastructure and application
- Experience in Managing Resiliency testing for On-Prem Database, NAS, Object Storage, Block Storage etc.,
- Understanding of disaster recovery procedures
- Understanding of RTO, RPO and how these metrics are calculated
- Knows differences between resiliency testing and cyber attack recovery/Repave test.
- Background in cyber attack recovery
- Background in disaster recovery.
- Strong analytical, communication, interpersonal, problem solving, organizational and time management skills
- Basic understanding of excel and the ability to manipulate data using excel. Knowledge of basic excel formulas used in data manipulation
- Self-motivated with an ability to work on one’s own with a strong sense of ownership and accountability
- Highly organized, strong attention to detail and excellent follow-up skills
- Strong process and project management skills with the ability to improve process efficiency and effectiveness
- Strong written and verbal communication skills with an ability to summarize complicated technical information to people with less technical knowledge
- Excellent influencing skills at all levels and the ability to develop and maintain good relationships with senior leadership, colleagues and clients
- 5-7 years of experience in disaster recovery and cyber attack recovery programs.
- Hands on experience in Managing Resiliency testing for On-Prem Database, NAS, Object Storage, Block Storage etc.,
- Hands-on expertise with Cloud platforms (AWS, Azure, GCP) and Kubernetes to support, manage and DR activities
- key player in building a disaster recovery program and extensive knowledge of RTO, RTA, RPO, RPA, MTD and other DR metrics.
- Has guided teams in building recovery test plans and has understanding of what should be in disaster recovery plans.
- Candidate posses solid understanding of core Data Center Infrastructure ( Network Appliances, Storage technology, Unix/Linux/Windows, IP Telephony etc), order of recovery in case of any incident.
- Strong understanding of various excel formulas used for data manipulation in excel.
- Project Management skills with ability to coordinate multiple Disaster Recovery tests and/or power down events simultaneously
- An understanding of any one, or more, of the following Technology Risk domains to include information security, business continuity, technology resilience, controls monitoring, risk assurance and risk governance
- Prior experience as either System Administrator or Application support role
- Ability to perform analysis or troubleshooting when an issue arises and provide possible alternatives to help establish solutions and confirm remediation of the issue