Product Support Engineer
Location: Kuala Lumpur, Malaysia
Employment Type: Permanent
Industry: Banking & Financial Services
About the Role
Our client, a leading global financial institution, is seeking an experienced Technology Service Manager to oversee major incident management, service restoration, stakeholder communications, and operational resilience across critical banking platforms. The successful candidate will play a key role in ensuring high availability of customer-facing services while driving continuous improvements in incident response and operational excellence.
Key Responsibilities
Incident & Service Management
- Manage and coordinate major technology incidents impacting customers and business operations.
- Assess business impact and determine incident priority based on established severity frameworks.
- Mobilize technical support teams and facilitate incident bridges to drive rapid service restoration.
- Ensure timely escalation and execution of incident management processes.
- Coordinate problem management activities and support Root Cause Analysis (RCA) reviews.
Stakeholder & Business Communication
- Act as the primary liaison between technology teams and business stakeholders during incidents.
- Provide clear and timely updates to senior management, business leaders, risk teams, and operational stakeholders.
- Facilitate real-time communication channels and business bridges during major incidents.
- Maintain incident dashboards and reporting mechanisms for ongoing service disruptions.
Operational Excellence & Knowledge Management
- Ensure incidents, outage details, business impacts, and recovery actions are accurately documented.
- Support trend analysis, service reporting, and continuous improvement initiatives.
- Capture lessons learned and ensure preventive measures are implemented to reduce recurrence.
- Maintain knowledge repositories, operational documentation, and recovery procedures.
Strategy & Continuous Improvement
- Support initiatives focused on customer experience monitoring and service reliability.
- Drive automation and workflow improvements within incident management processes.
- Participate in operational resilience, business continuity, and failure analysis programs.
- Contribute to innovation initiatives that improve incident response and reduce service impact.
Risk & Governance
- Identify operational risks and control weaknesses, ensuring appropriate remediation and escalation.
- Participate in governance forums and provide operational metrics and compliance reporting.
- Support adherence to incident, problem, and change management processes.
Requirements
Mandatory
- Minimum 5 years of experience in Production Support, Site Reliability Engineering (SRE), IT Operations, or Technology Service Management within the banking or financial services sector.
- Strong experience in Incident Management, Problem Management, and Change Management.
- Solid understanding of Linux/Unix systems, networking fundamentals, and application architectures.
- Hands-on experience with monitoring and observability platforms such as Grafana, ELK, AppDynamics, ITRS, BMC, or equivalent tools.
- Excellent communication and stakeholder management skills.
- Willingness to work rotational shifts, including day and night shifts.
Preferred
- Experience with cloud technologies, Kubernetes, OpenShift, and containerized environments.
- Scripting or automation experience using Python, Shell, or similar languages.
- Experience working in highly regulated industries.
- Exposure to ServiceNow, CI/CD pipelines, and modern SRE practices.
What You'll Bring
- Strong crisis management and decision-making capabilities.
- Ability to communicate effectively with both technical and non-technical audiences.
- Customer-focused mindset with a passion for service reliability and operational excellence.
- Experience managing high-severity incidents in complex enterprise environments.
- Continuous improvement mindset with a focus on automation and operational efficiency.