Remote Senior Site Reliability Engineer

Knack

Time zones: EST (UTC -5), CST (UTC -6), MST (UTC -7), PST (UTC -8), AKST (UTC -9), HST (UTC -10)
We’re looking for someone to help improve our reliability and performance through deep analysis and remediation of our AWS infrastructure, monitors, alerts, and code.
 Key Responsibilities 
  • Refactor our existing monitors and alerts to be actionable and reliable, recommending and implementing diagnostic techniques and monitoring tools.
  • Deep dive and analysis into RDS (Aurora PostgreSQL) performance, using that data to inform scaling policies and automation
  • Help discover correlations between customer experience and performance indicators to determine what is noticeable by customers, and suggest and implement improvements based on findings
  • Help us to develop SLI’s, SLO’s, and SLA’s that are impactful as they relate to our customer’s experience
  • Help triage outages and issues across multiple teams, services, and codebases as they arise, leading root cause analysis and creating stories to prevent and/or detect those issues in the future
  • Serve as technical lead for deep dives to identify solutions to prevent future incidents
  • Introduce chaos engineering, promoting experimentation in production to discover and remediate systemic weaknesses and improve performance and reliability
 Skills Knowledge and Expertise 
  • Expertise in AWS
  • Expertise with RDS, preferably Aurora PostgreSQL engine
  • Expertise with containerization
  • Experience with open source monitoring and visualization systems and tools, i.e. Prometheus (monitoring + tracing), Grafana/Kibana (dashboards), GrayLog (logging)
  • Experience implementing, maintaining, and troubleshooting continuous integration/continuous delivery (CI/CD) tooling
  • Experience with implementing improvements in areas such as maintainability, scalability, availability, extensibility and security
  • Ability to work with many teams across disciplines (cloud, platform, development, qa, and security) to resolve issues as they arise and implement improvements
  • Experience with distributed tracing, diagnostic tooling, application performance monitoring, and the golden signals
 Our Stack 
Our stack is evolving over the next year and we’d love you to be a part of that! Currently we’re using:
  • Back-end: JavaScript/TypeScript, Node.js, ES6, GoLang
  • Data: Aurora PostgreSQL, Redis, ElasticSearch
  • DevOps & Deployment: All things AWS, Terraform (and Terraform Cloud), Jenkins, Github, Grafana, GrayLog
  • Testing: Playwright, Mocha, Jest
  • Front-end: Vue.js, Webpack, SCSS
Subscribe Now