Detecting Stuck Sump Pumps: From 2 Hours to 90 Seconds

Warning : this is a low effort post generated by claude code, as a summary of a coding session aimed at fine tuning the sump pump monitoring. Don’t trust code & numbers here to be perfectly valid.

The problem

I have two sump pumps in my basement. Things tend to go wrong now and then, and of course, in many ways :

  • floater stuck (up or down)
  • impeller stuck
  • air leak in the pipes
  • electric issue or whatever

How the pumps work

The setup is pretty simple:

  • Shelly Pro PM device monitors and controls both pumps
  • Cronjob powers each pump for 5 minutes every hour (more than enough, and if impeller happens to be stuck, reduces the risk of overheat)
  • Float switches in the sump trigger the pumps when water level is high
  • Normally pumps run 1-3 times per day for 3-5 minutes each

So each pump gets 24 short power windows per day, but only actually runs when its float switch is activated (high enough water level).

Basic Inactivity check

One idea could be to check if the pump hasn’t run in n hours. Let’s say n = 12 :

if (Date.now() - lastTimeActive > 12 * 3600 * 1000) {
  alert("Pump hasn't been active in 12 hours!");
}

Valid approach, but it has to be adjusted to the calmest period of the year. This might be a case for exploring some predictive model…

Excess activity check

The logic: if the pump runs for the entire power window (close to 5 minutes), something’s probably wrong. Normal runs are 2-3 minutes then it stops when the sump is empty.

// Track cumulative active time in each power window
if (currentWindowActiveTime > 240000) { // 4 minutes
  consecutiveAlerts++;

  if (consecutiveAlerts >= 2) {
    alert("Pump ran >4 min in 2 consecutive windows - possible stuck impeller");
  }
}

Configuration:

{
  device: "vip130",
  maxRunDuration: 480000,  // 8 minutes
  checkPeriod: 3600000,     // Check every hour
  pwThreshold: 50           // Consider "active" if >50W
}

That looks ok. Alert after 2 consecutive windows should reduce the risk of false positives.

Time for alert is 2h, which is not too bad.

Grafana insights

I export power metrics to Grafana every minute, and had enough data to find past metrics related to an issue with the pump clearly active but not as usual (maybe stuck impeller or some airleak causing the flow not to start)

VIP130 Pump Power Draw

State Power
Idle <10W
Normal pumping 220-222W
Stuck impeller 149-152W

70W difference between normal and stuck.

This is a clear signal we can detect!

The other pump (Longlife) draws ~200W normally but I don’t have stuck event data for it yet. So I’ll implement power signature detection for VIP130 only and keep the duration-based detection as backup for both.

Third try: power signature detection

Now we can detect abnormal power directly. But I wanted to be conservative to avoid false positives from startup transients:

// Wait 30s for startup stabilization
// Then collect power samples for 60s
// If 3+ consecutive readings <170W → ALERT

let isActive = false;
let startTime = null;
let powerSamples = [];
let lowPowerReadings = 0;

function onPowerEvent(event) {
  let power = event.delta.apower;

  if (!isActive && power >= 50) {
    isActive = true;
    startTime = Date.now();
    powerSamples = [];
    lowPowerReadings = 0;
    return;
  }

  if (isActive) {
    powerSamples.push({ timestamp: Date.now(), power: power });
    powerSamples = powerSamples.filter(s => Date.now() - s.timestamp < 60000);

    if (powerSamples.length < 3) return;

    let avgPower = powerSamples.reduce((sum, s) => sum + s.power, 0) / powerSamples.length;

    if (avgPower < 170) {
      lowPowerReadings++;
      if (lowPowerReadings >= 3) {
        alert("VIP130 stuck: " + avgPower.toFixed(0) + "W (expected 200-230W)");
      }
    } else {
      lowPowerReadings = 0;
    }
  }
}

Conservative approach:

  • 60s sampling window for stable average
  • 3 consecutive low readings confirms it’s sustained
  • 170W threshold is safe margin between stuck (150W) and normal (220W)

Configuration

export let pump1PowerSignatureConfig = {
  device: "vip130",
  component: "switch:1",

  // Empirical ranges from Grafana
  normalPowerRange: { min: 200, max: 230 },
  stuckPowerRange: { min: 140, max: 160 },

  // Detection parameters (conservative)
  pwThreshold: 50,                    // Active threshold
  stabilizationTime: 30000,           // 30s startup delay
  samplingDuration: 60000,            // 60s sample window
  alertThreshold: 170,                // Alert if avg <170W
  minConsecutiveLowReadings: 3,       // Require 3 consecutive

  eventFilter: function(event) {
    return event.component == "switch:1" && event.delta.apower;
  }
};

How fast is this?

Timeline when pump gets stuck:

t=0s:    Pump starts
t=90s:   Sample 1: 151W (lowPowerReadings = 1)
t=150s:   Sample 2: 149W (lowPowerReadings = 2)
t=210s:   Sample 3: 152W (lowPowerReadings = 3)
t=210s:   🚨 ALERT

180 seconds from startup to alert.

Duration-based detection took 1-2 hours. This is much faster.

Why I kept all three detection methods

I didn’t replace duration-based detection, I kept all three layers:

Layer 1: Power signature (180s) - Direct measurement, high confidence, VIP130 only

Layer 2: Run duration (1-2hr) - Extended runtime in consecutive windows, both pumps

Layer 3: Run frequency (24hr) - Too many starts per day, catches different failure modes like stuck float switch

t=180s:   🚨 Power signature alert - I get notified, can investigate
t=2hr:   🚨 Duration alert - Confirms the power signature was right
t=24hr:  🚨 Frequency alert (if still broken) - Problem is ongoing

Fast detection plus backup confirmation from independent signals.

What’s next

Use alerting on grafana - Shelly device could loose connectivity or stop working. Grafana alerts should detect that

Predictive maintenance - Track gradual power decline (220W → 210W → 200W) to catch bearing wear before total failure.

Wrapping up

Looking at historical Grafana data during a stuck event revealed a 70W power difference I could use for detection. This got me from 2-hour detection to 3 minutes for some issues.

Key points:

  • Log everything, analyze later - empirical data beats guessing
  • Be conservative with thresholds - 3 consecutive readings prevents false positives
  • Layer detection - fast + slow layers catch different failure modes
  • Event-driven when possible - saves resources, faster response

The build pipeline from part 1 made iterating on this quick - write strategy in ES6+, get type checking, build to minified ES5 in ~100ms.

Resources