Our client is a remote-first company with team members across the globe! Offering a SaaS-based Learning Management System powering the world's leading education programs. Our client helps large brands and fast-moving companies increase revenue, improve customer retention, and decrease support costs through external education. The platform includes all the tools an organization needs to create, manage, track, and improve highly personalized learning experiences for customers, partners, and employees.
Successful Candidate:Experienced and able to thrive in a small-medium high-growth environmentInvested in upskilling, learning new techDeeply curious, creative, and innovativeFlexible in working hours/ability to collaborate in different time zones
The Lead Site Reliability Engineer has a pivotal role at the forefront of our engineering operations, responsible for guiding the Platform Team toward achieving exceptional standards of reliability, performance, and stability across all our applications. The successful candidate will possess deep expertise in these core areas and will be instrumental in defining and implementing industry-leading practices. As a key leader, this role will not only shape the strategic direction of our platform operations but also establish the benchmarks and processes by which our engineering excellence is measured.
ResponsibilitiesLead the SRE Team, setting clear goals and priorities in line with business objectives. In collaboration with the department Director develop and execute strategies that enhance technological capabilities across the companyEnsure all platforms and systems operate smoothly and remain highly available, scalable, and fault-tolerant. Implement best practices for continuous monitoring, preventive maintenance, and rapid response.Continuously assess system performance, identify bottlenecks, and make data-driven recommendations for infrastructure enhancements.Ensure that developers have access to the best tools and platforms to facilitate efficient coding practices and understand the performance of applications..Educate the rest of engineering about best practices for writing performant code and troubleshoot problematic areasDevelop and refine incident management protocols. Lead efforts to troubleshoot and resolve high-impact issues, minimizing downtime and preventing future occurrences.Work closely with other engineering teams and departments to understand their needs and ensure platform initiatives support overall company goals.Monitor virtual infrastructure and be part of a 24x7 on-call rotation to respond to alerts
Requirements5+ years of experience working with Ruby on Rails (recent)Direct experience with Javascript and Node.js (recent)Experience leading more Junior-level teammates3+ years of experience working in infrastructure and operationsExpertise with SQL databases such as PostgreSQLExperience with Cloud computing Amazon Web Services and/or Google CloudAbility to dig into unfamiliar code basesAbility to document solutions and train operational teams on supportabilityA sense of comfort working in a team-oriented and collaborative environmentCan communicate clearly and seek help and support proactively