We are looking for a self-driven engineer to help scale our growing public cloud presence. The team is responsible for cross-cutting cloud management capabilities and are the experts on the state of cloud platforms at any moment. The team comes from diverse technical backgrounds, and the responsibilities provide opportunity for a variety of challenges that require engineers to work on software and systems challenges. Ideal candidates will have a background in either software engineering or systems engineering with a desire to learn the other or previous experience as an SRE. As a Site Reliable Engineering in the Cloud domain, you will have the opportunity to shape the cloud operations that support over 30 million investors and their financial futures.
The Expertise We’re Looking For
- Bachelor’s degree or higher in a technology related field (e.g. Engineering, Computer Science, etc.) required, Master’s degree a plus
- 4+ years of hands-on experience deploying and/or supporting highly distributed multi-tiered systems at scale.
- Hands-on experience with Public Cloud environments, preferably AWS and Azure. Certifications a plus.
- Experience with container orchestration, preferably with Kubernetes
- Collaboration and Relationships - Ability to work with a variety of individuals and groups, both in person and virtually, in a constructive and collaborative manner and build and maintain effective relationships
- Overall 8+ years of working experience is required
- Experience with enabling and managing cloud services, usage, and optimizations
- Experience with enabling and managing cloud services, usage and optimization
- 4+ years experience designing, implementing and managing Kubernetes (EKS, AKS)
- Experience with programing languages such as Python and Go
- Experience designing, implementing and hosting solutions based in AWS and/or Azure
- AWS, Azure and/or Kubernetes certifications
- Bachelor’s Degree in Computer Science, Mathematics, or related sciences – or equivalent work experience
- 8+ years IT experience
The Skills You Bring
- Hands on experience on Observability and Resiliency set up for platform and applications.
- Solid understanding/experience of networking, virtualization, storage, containers, and serverless
- Experience with Linux systems, ideally with experience in systems administration
- Ability to automate with various scripting languages (Python, Shell scripting, etc.)
- Experience managing systems using infrastructure as code tools (IAM, ARM, Terraform, Chef, …)
- Demonstrated ability to utilize modern monitoring tools (DataDog, Prometheus, Splunk, …)
- Proficiency with CI/CD tools, especially Jenkins
- Ability to triage, execute root cause analysis, and be decisive under pressure
- Experience managing and interpreting large datasets using query languages and visualization tools
- Ability to think in systems and apply technical and non-technical problem solving
- Experience with Enterprise IT asset management or other related practices
- Proficient communication skills with an ability to reach both technical and non-technical audience
- Desire to call yourself a Site Reliability Engineer and a commitment to reducing toil
The Value You Deliver
- Help define and execute a comprehensive cloud reliability and observability strategy, ensuring that cloud systems are always available when our customers need them.
- Bring together technical, procedural, and financial data to reduce toil and increase efficiency
- You will execute plans for technical standardization and process refinement within the engineering organization, especially for Site Reliability Engineers.
- Troubleshoot stack-wide engineering issues related to hardware, software, network, applications and cloud service providers.
- Take part in peer code reviews providing qualitative feedback and facilitate and learning environment through equitable exchange of ideas.
- Coach peer SREs and development teams on how to build highly available cloud systems