Do what you love. Love what you do.
At Workday, we help the world’s largest organizations adapt to what’s next by bringing finance, HR, and planning into a single enterprise cloud. We work hard, and we’re serious about what we do. But we like to have fun, too. We put people first, celebrate diversity, drive innovation, and do good in the communities where we live and work.
About the TeamWorkday is building a new SRE team responsible for deploying, operating and supporting a state of the art cloud native service platform. The platform is built using Cloud Native (https://www.cncf.io/) technologies, on a foundation of Kubernetes in both Public Cloud and Private Cloud environments. This provides a secure platform on which dozens of Workday service teams, and Platform development teams can build and test their pre-release code, through deployment to production on a continuous basis.
About the Role
The primary function of the SRE team is to ensure the reliability and availability of the platform to meet the desired SLAs, reduce operational load and to scale sustainably in alignment with business growth. We work closely with a dedicated Environment Operations team in supporting the customer facing environments for our end customers including patching our customer environments, and pursuing resolutions to issues found during that patch process.
All SRE responsibilities and team growth will be supported by Service Level Objectives (SLOs)
Have ownership of :
- overall system health, and holding engineering teams accountable to meet agreed SLO's such as latency and error rates.
- weekly platform release preparation, including evolving the automation towards zero touch. Follow through on improvements identified post-patch.
- vulnerability management, holding teams accountable to meet customer facing Service Level Agreements (SLAs)
Be responsible for:
- Updating the platform continuously in line with the major and minor release cycles of open source projects such as Kubernetes, Istio, Calico.
- Providing expert level support for the buildout of new Customer environments
- Collaborating with cross functional teams to come up with automation solutions
- Leading new Service team onboarding engagements, in partnership with Platform Engineering Architects and Product Managers
You have a passion for identifying and solving problems on distributed environments scaling across configuration, Linux Operating System and network. You have hands-on experience handling distributed environments (Kubernetes experience is a big plus). You have a keen interest in improving operational efficiency, and believe that automation is the key to operating large-scale systems. You are driven to ensure customer success.
- BS in Computer Science or related field.
- 3+ years experience in managing and troubleshooting distributed systems. (AWS, GCP, Kubernetes, Docker)
- 5+ years of solid SRE experience in a distributed systems environment.
- Extensive engineering experience with Linux.
- Proficiency with at least one of (GoLang, Python, Ruby), preferably GoLang (Go)
- Bash and scripting experience.
- Understanding of software development standard methodologies such as code management, CI/CD.
- Passionate automator, with a track record of referenceable examples.
- Can work independently and with the attitude that everything can be automated.
- Skills and enthusiasm to operate, maintain, support and sustain the platform.
- Excited by working in a fast-paced environment.
- Experience collaborating with cross functional global and remote teams with a diverse set of backgrounds.
- Excellent documentation skills, experience with developing detailed runbooks, processes
- Ideally, you have knowledge and experience of public cloud platforms such as AWS, GCP, Azure etc
Workday is an Equal Opportunity Employer including individuals with disabilities and protected veterans.