Do you like to work with existing and new software product development teams? This position is to instrument end-to-end observability and visibility for business-critical systems with log ingestion, metrics, and traces. You will function as a site reliability engineer (SRE) that will collaborate with product teams, infrastructure SMEs, DevOps engineers, and the proactive monitoring team to provide unique dashboards of germane service level analytics for various product stakeholders.
Work closely with software product development teams (ITSO, Product Owner, SME) to implement monitoring & observability instrumentation within their platforms.
Drive adoption of best practices in monitoring, alerting, automation, and site reliability.
Lead/contribute to engineering efforts from design to implementation focusing on instrumentation of logs, metrics, and traces.
Drive use of automation in software instrumentation as well as in response to service degradation events.
Identify and execute on opportunities to implement instrumentation in pre-production environments.
Proactively pursue continuous improvement and expansion in observability coverage, service reliability best practices, incident management, and problem management.
Must have
Production support experience as developer for e-commerce platform
Strong knowledge and experience in Java
SRE experience
Scripting experience
5+ years of experience with administrating Linux and at least 2 years in supporting production environments;
Experience with designing large-scale distributed solutions accompanied with it's capacity planning;
Deep understanding of TCP/IP networking;
Familiar with SLA, SLO, and SLI terms;
Experience with monitoring and alerting tools like Grafana, Datadog, Prometheus etc;
Strong knowledge of virtualization and containerization principles including orchestration tools;
Familiar with CaC and IaC tools (Ansible, Salt, Terraform, Packer);
Familiar with CI/CD tools (Jenkins, Azure DevOps);
Experience with relational and NoSQL DBMS
A clear understanding of Agile and DevOps culture and what kind of problem they intended to solve;
Strong written and verbal communication skills;
Understanding of information security principles;
Understanding of popular deployment strategies (Feature flags, Blue/Green, Canary, Dark launch, etc);
"Critical thinker" and "problem solver"
Nice to have
Experience working with Azure
Previous experience of working in SRE teams;
Languages
English: B2 Upper Intermediate
Seniority
Senior
...Rhythm Test - Must Pass at time of hire (Required if hired after 11/15/2018) Required - If position is identified as an internal... ...eligible for overtime) under the level of Manager are listed as hourly for compensation purposes on this posting. The work shift will contain...
...Family Home Sales Location : St. Mary Cemetery & Funeral Center -Oakland, CA At our Funeral Home, the Funeral Home Sales serves the families by guiding them through the process of making informed decisions on what their loved ones would have liked for their...
...the opportunity to be a part of the BAND.As the Summer 2025 Graphic Design Intern, you will work with Bandwidths Internal... ...spirit. How do we do that? WellAre you ready for an awesome internship experience? At Bandwidth were all about making your time with...
...Looking for a security officer. Job Details: ~ Has a BSIS Security License that lives close to San Leandro, CA.~ The job requires you to walk around a Children's Hospital checking for possible intruders along the fence line.~ You must be able to report your...
...must submit a cover letter to be considered** The Content Writer plays a pivotal role in driving revenue growth by drafting personalized... ...via conference calls and online meetings. Must be able to travel up to 10% of the time. Other requirements: ~...