Detecting Stuck Sump Pumps: From 2 Hours to 90 Seconds
Warning : this is a low effort post generated by claude code, as a summary of a coding session aimed at fine tuning the sump pump monitoring. Don’t trust code & numbers here to be perfectly valid.
The problem
I have two sump pumps in my basement. Things tend to go wrong now and then, and of course, in many ways :
- floater stuck (up or down)
- impeller stuck
- air leak in the pipes
- electric issue or whatever
How the pumps work
The setup is pretty simple:
- Shelly Pro PM device monitors and controls both pumps
- Cronjob powers each pump for 5 minutes every hour (more than enough, and if impeller happens to be stuck, reduces the risk of overheat)
- Float switches in the sump trigger the pumps when water level is high
- Normally pumps run 1-3 times per day for 3-5 minutes each
So each pump gets 24 short power windows per day, but only actually runs when its float switch is activated (high enough water level).
Basic Inactivity check
One idea could be to check if the pump hasn’t run in n hours. Let’s say n = 12 :
if (Date.now() - lastTimeActive > 12 * 3600 * 1000) {
alert("Pump hasn't been active in 12 hours!");
}
Valid approach, but it has to be adjusted to the calmest period of the year. This might be a case for exploring some predictive model…
Excess activity check
The logic: if the pump runs for the entire power window (close to 5 minutes), something’s probably wrong. Normal runs are 2-3 minutes then it stops when the sump is empty.
// Track cumulative active time in each power window
if (currentWindowActiveTime > 240000) { // 4 minutes
consecutiveAlerts++;
if (consecutiveAlerts >= 2) {
alert("Pump ran >4 min in 2 consecutive windows - possible stuck impeller");
}
}
Configuration:
{
device: "vip130",
maxRunDuration: 480000, // 8 minutes
checkPeriod: 3600000, // Check every hour
pwThreshold: 50 // Consider "active" if >50W
}
That looks ok. Alert after 2 consecutive windows should reduce the risk of false positives.
Time for alert is 2h, which is not too bad.
Grafana insights
I export power metrics to Grafana every minute, and had enough data to find past metrics related to an issue with the pump clearly active but not as usual (maybe stuck impeller or some airleak causing the flow not to start)
VIP130 Pump Power Draw
| State | Power |
|---|---|
| Idle | <10W |
| Normal pumping | 220-222W |
| Stuck impeller | 149-152W |
70W difference between normal and stuck.
This is a clear signal we can detect!
The other pump (Longlife) draws ~200W normally but I don’t have stuck event data for it yet. So I’ll implement power signature detection for VIP130 only and keep the duration-based detection as backup for both.
Third try: power signature detection
Now we can detect abnormal power directly. But I wanted to be conservative to avoid false positives from startup transients:
// Wait 30s for startup stabilization
// Then collect power samples for 60s
// If 3+ consecutive readings <170W → ALERT
let isActive = false;
let startTime = null;
let powerSamples = [];
let lowPowerReadings = 0;
function onPowerEvent(event) {
let power = event.delta.apower;
if (!isActive && power >= 50) {
isActive = true;
startTime = Date.now();
powerSamples = [];
lowPowerReadings = 0;
return;
}
if (isActive) {
powerSamples.push({ timestamp: Date.now(), power: power });
powerSamples = powerSamples.filter(s => Date.now() - s.timestamp < 60000);
if (powerSamples.length < 3) return;
let avgPower = powerSamples.reduce((sum, s) => sum + s.power, 0) / powerSamples.length;
if (avgPower < 170) {
lowPowerReadings++;
if (lowPowerReadings >= 3) {
alert("VIP130 stuck: " + avgPower.toFixed(0) + "W (expected 200-230W)");
}
} else {
lowPowerReadings = 0;
}
}
}
Conservative approach:
- 60s sampling window for stable average
- 3 consecutive low readings confirms it’s sustained
- 170W threshold is safe margin between stuck (150W) and normal (220W)
Configuration
export let pump1PowerSignatureConfig = {
device: "vip130",
component: "switch:1",
// Empirical ranges from Grafana
normalPowerRange: { min: 200, max: 230 },
stuckPowerRange: { min: 140, max: 160 },
// Detection parameters (conservative)
pwThreshold: 50, // Active threshold
stabilizationTime: 30000, // 30s startup delay
samplingDuration: 60000, // 60s sample window
alertThreshold: 170, // Alert if avg <170W
minConsecutiveLowReadings: 3, // Require 3 consecutive
eventFilter: function(event) {
return event.component == "switch:1" && event.delta.apower;
}
};
How fast is this?
Timeline when pump gets stuck:
t=0s: Pump starts
t=90s: Sample 1: 151W (lowPowerReadings = 1)
t=150s: Sample 2: 149W (lowPowerReadings = 2)
t=210s: Sample 3: 152W (lowPowerReadings = 3)
t=210s: 🚨 ALERT
180 seconds from startup to alert.
Duration-based detection took 1-2 hours. This is much faster.
Why I kept all three detection methods
I didn’t replace duration-based detection, I kept all three layers:
Layer 1: Power signature (180s) - Direct measurement, high confidence, VIP130 only
Layer 2: Run duration (1-2hr) - Extended runtime in consecutive windows, both pumps
Layer 3: Run frequency (24hr) - Too many starts per day, catches different failure modes like stuck float switch
t=180s: 🚨 Power signature alert - I get notified, can investigate
t=2hr: 🚨 Duration alert - Confirms the power signature was right
t=24hr: 🚨 Frequency alert (if still broken) - Problem is ongoing
Fast detection plus backup confirmation from independent signals.
What’s next
Use alerting on grafana - Shelly device could loose connectivity or stop working. Grafana alerts should detect that
Predictive maintenance - Track gradual power decline (220W → 210W → 200W) to catch bearing wear before total failure.
Wrapping up
Looking at historical Grafana data during a stuck event revealed a 70W power difference I could use for detection. This got me from 2-hour detection to 3 minutes for some issues.
Key points:
- Log everything, analyze later - empirical data beats guessing
- Be conservative with thresholds - 3 consecutive readings prevents false positives
- Layer detection - fast + slow layers catch different failure modes
- Event-driven when possible - saves resources, faster response
The build pipeline from part 1 made iterating on this quick - write strategy in ES6+, get type checking, build to minified ES5 in ~100ms.
Resources
- Part 1: Modern Shelly Development - Build pipeline setup
- Part 3: Architecture Patterns - Strategy pattern on embedded devices
- Shelly Pro PM docs