"Turning system noise into customer confidence — one signal at a time."
// 01 — About
I'm a DevOps & SRE professional with over 11 years of experience operating production-grade SaaS platforms, cloud-native infrastructure, and B2B API ecosystems — at the precise intersection where engineering depth meets customer trust.
I specialize in Kubernetes, CI/CD automation, OpenTelemetry-based observability, and incident ownership — helping teams move from reactive firefighting to proactive platform intelligence. Whether designing a GitHub Actions pipeline to GCP Cloud Run, debugging a latency anomaly through distributed traces, or translating a crash-loop backtrace into a board-level risk narrative, I bridge deep technical craft with business impact.
Most recently at Middleware.io, I owned end-to-end observability support for 50+ enterprise accounts — becoming the primary escalation point for complex APM, trace, and metric debugging. Before that, I built CI/CD pipelines and led a 24/7 L3 SRE team at Ola Electric, and delivered Fortune 500 technical account success at Sprinklr.
// 02 — Tech Stack
// 03 — Experience
Owned production observability and troubleshooting for 50+ global enterprise accounts across APM, logs, metrics, traces, and RUM. Debugged complex distributed systems — latency spikes, CPU/memory pressure, trace gaps — using OpenTelemetry, Prometheus, and Grafana, reducing MTTR by 35%. Reproduced customer workloads in Node.js, Python, Kubernetes, and Cloud Run to isolate root causes and drive engineering fixes. Translated production issues into actionable product feedback, directly influencing the OpsAI and Continuous Profiling roadmap. Authored technical runbooks and best-practice guides that improved customer self-service and platform stability.
Took a planned career break for full-time parenting while actively building hands-on DevOps skills. Shipped GitHub Actions pipelines deploying to GCP Cloud Run, built Prometheus/Grafana monitoring stacks with Slack-integrated alerting, and practiced infrastructure-as-code with Terraform and Docker Compose. Pursued Microsoft Power BI Data Analyst certification (completed) and progressed toward Google Cloud Professional DevOps Engineer, CKA, and CKAD.
Owned production reliability and incident response for an OTA vehicle software platform running large-scale 24/7 deployments. Designed CI/CD pipelines on Azure DevOps — improving deployment velocity by 40% and cutting manual release steps by 80%. Built observability dashboards across Grafana, Dynatrace, and Splunk, and managed API Gateway integrations via Apigee. Led incident management and RCA workflows, reducing MTTR by 40% through automated runbooks and post-mortem process improvements. Scaled and mentored the 24/7 L3 support team from 7 to 20 engineers — establishing on-call rotations, SLA frameworks, and training programs.
Served as technical SPOC for Fortune 500 enterprise accounts — Apple, Microsoft, Dell, Samsung, UBS — maintaining 95%+ CSAT. Led B2B API integrations across Salesforce, Slack, Zendesk, and Adobe Analytics, ensuring zero-downtime migrations. Managed a GetSatisfaction → Sprinklr Communities platform migration with SSO/SAML configurations for 100K+ users. Conducted RCAs for critical incidents, mentored junior engineers, and improved operational best practices across the team.
5 years building a strong foundation across enterprise API support, IVR/VoIP platform operations, and production troubleshooting at scale. Worked across telecom, IT services, and conglomerate verticals — developing the systematic debugging mindset and customer-facing technical depth that powers everything since.
// 04 — Skills
// 05 — Projects & Open Source
A production-grade personal portfolio site with a Three.js 3D hero, scroll-triggered animations, amber/charcoal dark theme, and full mobile responsiveness. Built from scratch — no templates.
↗ github.com/MahendraRao/mahendra-portfolioAn interactive 3D character experience built with Three.js. A dancing Humpty Dumpty you can control — built to sharpen my 3D rendering and real-time animation skills in the browser.
↗ github.com/MahendraRao/3d-humptyAn exploratory creative coding experiment — pushing the boundaries of what pure HTML and JavaScript can express. One of my most starred personal projects.
↗ github.com/MahendraRao/consciousnessA GitHub-hosted AI wrapper for OpenClaw — built as a React + Vite frontend and local Express + TypeScript API. Simplifies installation, system diagnostics, provider setup (OpenAI / Anthropic / Ollama / custom), and first-run onboarding for non-coders. Working MVP today; Electron packaging, OpenTelemetry, and Docker/K8s integration on the roadmap.
↗ github.com/MahendraRao/clawbridgeA CLI tool that interrogates your OpenTelemetry pipeline — checking collector health, trace ingestion rates, span drop rates, and SDK configuration — and surfaces actionable diagnostics. The runbook I always wished existed during enterprise escalations.
↗ Coming soon on GitHubA production-ready GitHub Actions workflow template for deploying containerized apps to GCP Cloud Run — with built-in secret management, smoke tests, rollback on failure, and Slack notifications. Built during upskilling, now being polished for open source.
↗ Coming soon on GitHubDesigned and executed a structured onboarding program for enterprise customers adopting OpenTelemetry from scratch. Reduced time-to-value from weeks to days by building instrumentation guides, debug runbooks, and live troubleshooting playbooks for Go, Python, and Node.js stacks.
Built a tiered escalation framework and knowledge infrastructure at a company scaling at breakneck speed. Defined SLAs, trained L1/L2 agents, and created engineering feedback loops that led to measurable reduction in repeat incidents across the product lifecycle.
Managed strategic success programs for Fortune 500 accounts on the Sprinklr CXM platform. Delivered impactful QBRs, drove feature adoption, and acted as the technical-business bridge during critical escalations — converting at-risk relationships into multi-year renewals.
// 06 — Writing
Lessons from instrumenting real production systems — the gotchas, the sampling edge cases, and why your traces are probably lying to you.
Read more →How we redesigned the OTA software deployment pipeline at Ola Electric — and what most teams get wrong about release velocity vs. release safety.
Read more →A field-tested guide to pod failures, resource starvation, and network black holes — written from a support engineer's perspective, not a textbook author's.
Read more →// 07 — Contact
Open to DevOps Engineer, SRE, Cloud TSE, Solutions Engineer, and Platform Engineering roles — especially in observability, cloud infrastructure, or enterprise SaaS. Based in Bengaluru · Open to remote & hybrid.
nkneelkumar [at] gmail [dot] com