OPT-EAD/HIB/GC - U.S based Candidates Only
Job SummaryWe are seeking a highly skilled Site Reliability Engineer role with development and application performance monitoring experience to collaborate closely with development and engineering teams to ensure our services and systems are stable, meet performance standards, and fulfil the expectations of our business partners and end users. As an SRE, you will leverage your expertise in software development, complexity analysis, and scalable system design to deliver automation solutions that ensure high availability and resiliency. Additionally, you will work with the Incident Management Engineering team to address active incidents and drive longterm availability and performance improvements.
Duties and ResponsibilitiesIn this role, you will:• Monitor system performance, identify areas for improvement, and implement solutions to enhance reliability and availability.• Guide architecture and development teams on how to make applications highly available,reliable, and performant at a global scale• Partner with architecture teams to ensure operability, measurability, and manageability are accounted for in business features and enablers• Collaborate with product owners and managers to Implement and monitor key metrics to meet SLOs and SLA• Collaborate with development team members to troubleshoot and resolve problems• Drive the Root Cause Analysis of production issues and other failures within the product software, pipeline, or other DevOps support processes or technology• Design, build, and champion automated solutions and tasks to optimizeapplication/service/platform uptime with minimal human intervention• Develop tools and processes to monitor the AWS resources and cloud applications• Use Kubernetes and Docker to deploy platform services• Create and implement standards and best practices, driving adoption acrossdevelopment teams and external vendors as applicable• Requirements and Qualifications
Expertise and/or relevant experience in the following areas is mandatory:• Bachelor or above degree in Computer Science or a related technical discipline• 5+ years’ experience in the deployment, administration, and troubleshooting of largescale distributed systems.• 5+ years of experience in Automation Programming in one or more scriptingprogramming languages: Python, Golang, Java, Ruby, Rust, or JavaScript (with aproficiency in Python and React Engine).• Strong understanding of Unix/Linux operating systems internals and administration• Strong understanding of networking (e.g. TCP/IP, routing, network topologies, andhardware), storage systems, and database systems).• 5+ years of hands-on experience with cloud service concepts and cloudNetworking.• Strong experience troubleshooting across Enterprise applications, operating systems, and networking layers• Strong Development experience in debugging and optimizing code, software deployment, and automating routine tasks• Strong understanding of Kubernetes, Docker containers, MySQL, and MicroserviceSystems• Experience with Application Performance Monitoring Tools (New Relic, AppDynamics,Dynatrace, etc.)• Experience with distributed streaming systems (Kafka, Kinesis, etc.)• Excellent organizational skills in planning and prioritizing own workload and initiatives• Strong skills in problem-solving and communication• SRE experience including:o Monitoring (Prometheus, Grafana)o Alert creationo Alert tuning• Remote Work and willing to work West Coast hours (10 AM – 8 PM PST)• Willing to work in on-call rotation to participate in troubleshooting and communication efforts outside of normal business hoursExpertise and/or relevant experience in the following areas is preferred:• Strong communication skills and presentation skills• Excellent command of the English language (written and spoken)