About IBM Cloudant
IBM Cloudant is a fully managed NoSQL database service, offering Database-as-a-Service (DBaaS) across all IBM Cloud regions. As a critical component of the IBM Cloud ecosystem, Cloudant enables businesses to focus on their applications while we ensure high availability, performance, and reliability at scale.
We are looking for an IT Operations & Incident Response Specialist to join our global team. This role is essential for maintaining the stability and efficiency of our services in a 24x7 production environment. The ideal candidate will have expertise in incident response, system monitoring, and DevOps methodologies, ensuring rapid and effective solutions to operational challenges.
Why Join Us?
- Be part of a global team ensuring the smooth operation of IBM Cloudant, a key component of the IBM Cloud ecosystem.
- Work in a collaborative, high-impact environment with opportunities for growth and innovation.
- Competitive salary, benefits, and career development opportunities.
If you're passionate about IT operations, incident response, and service optimization, we'd love to hear from you! Apply today.
- Incident Response & Monitoring
- Serve as the first line of defense for operational incidents, responding promptly to on-call pages.
- Monitor system performance and troubleshoot live service issues.
- Participate in a regional service pager rotation (12:00 – 20:00 local time) and shared weekend/public holiday on-call rotations.
- Service Management & Optimization
- Prepare for new or changed services and oversee the change management process.
- Manage IT services and products, including outsourced services such as public and virtual private networks.
- Ensure compliance with regulatory, legal, and professional standards in IT service delivery.
- Automation & Performance Improvement
- Develop scripts and tools to automate troubleshooting and improve operational efficiency.
- Analyze system performance, conduct root cause analysis (5 Whys, etc.), and implement preventative measures.
- Contribute to developmental projects, learning new technologies as needed.
- Documentation & Communication
- Maintain incident response procedures, operational runbooks, and escalation paths.
- Provide service-level reporting, risk assessments, and contingency planning.
- Collaborate effectively with internal teams, external partners, and stakeholders.
- Proven experience in IT operations, system monitoring, and incident response in a 24/7 production environment.
- Strong technical knowledge of IT hardware, software, communications, and application solutions.
- Ability to troubleshoot complex issues and develop effective solutions.
- Experience managing change processes, regulatory compliance, and outsourced services.
- Proficiency in risk management, contingency planning, and service-level reporting.
- Strong communication and collaboration skills to work across teams and with stakeholders.
- Ability to work independently, prioritize tasks, and contribute to team and organizational goals.
- Linux system administration certification (preferred).
- Hands-on experience with Debian Linux, system optimization, and configuration.
- Knowledge of Erlang, Couch Db, Python, Kubernetes, or similar scripting languages.
- Experience operating and managing public cloud platforms (e.g., IBM Cloud).
- Track record of delivering complex production services with a focus on stability and uptime.