ERCOT: Texas was ‘seconds and minutes’ away from catastrophic months-long blackouts

Authored by kwtx.com and submitted by 711jm

(TEXAS TRIBUNE) - Texas’ power grid was “seconds and minutes” away from a catastrophic failure that could have left Texans in the dark for months, officials with the entity that operates the grid said Thursday.

As millions of customers throughout the state begin to have power restored after days of massive blackouts, officials with the Electric Reliability Council of Texas, or ERCOT, which operates the power grid that covers most of the state, said Texas was dangerously close to a worse-case scenario: uncontrolled blackouts across the state.

The quick decision that grid operators made in the early hours of Monday morning to begin what was intended to be rolling blackouts — but lasted days for millions of Texans — occurred because operators were seeing warning signs that massive amounts of energy supply was dropping off the grid.

As natural gas fired plants, utility scale wind power and coal plants tripped offline due to the extreme cold brought by the winter storm, the amount of power supplied to the grid to be distributed across the state fell rapidly. At the same time, demand was increasing as consumers and businesses turned up the heat and stayed inside to avoid the weather.

“It needed to be addressed immediately,” said Bill Magness, president of ERCOT. “It was seconds and minutes [from possible failure] given the amount of generation that was coming off the system.”

Grid operators had to act quickly to cut the amount of power distributed, Magness said, because if they had waited, “then what happens in that next minute might be that three more [power generation] units come offline, and then you’re sunk.”

Magness said on Wednesday that if operators had not acted in that moment, the state could have suffered blackouts that “could have occurred for months,” and left Texas in an “indeterminately long” crisis.

The worst case scenario: Demand for power overwhelms the supply of power generation available on the grid, causing equipment to catch fire, substations to blow, power lines to go down.

If the grid had gone totally offline, the physical damage to power infrastructure from overwhelming the grid can take months to repair, said Bernadette Johnson, senior vice president of power and renewables at Enverus, an oil and gas software and information company headquartered in Austin.

“As chaotic as it was, the whole grid could’ve been in blackout,” she said. “ERCOT is getting a lot of heat, but the fact that it wasn’t worse is because of those grid operators.”

If that had occurred, then even as power generators recovered from the cold, ERCOT would have been unable to quickly reconnect them back to the grid, Johnson said.

Grid operators would have needed to slowly and carefully bring generators and customers back online, all the while taking care to not to cause more damage to the grid. It’s a delicate process, Johnson explained, because each part of the puzzle — the generators producing power, the transmission lines that move the power, and the customers that use it — must be carefully managed.

“It has to balance constantly,” she said. “Once a grid goes down, it’s hard to bring it back online. If you bring on too many customers, then you have another outage.”

ERCOT officials have repeatedly said that the winter storm that swept the state caught power generators off guard. The storm far exceeded what ERCOT projected in the fall to prepare for winter.

“The operators who took those actions to prevent a catastrophic blackout, and much worse damage to our system, that was, I would say, the most difficult decision that had to be made throughout this whole event,” Magness said.

Nine grid operators are working at any given time who make these sorts of decisions, said Leslie Sopko, a spokesperson for ERCOT.

“At the end of the day, our operators are highly-trained and have the authority to make decisions that protect the reliability of the electric system,” she said in a statement.

ERCOT made “significant progress” overnight Wednesday to restore customer power to many Texans, and remaining power outages are likely due to ice storm damage to the distribution system. Some areas that were taken offline will also need to be restored manually, according to ERCOT.

ERCOT warned that emergency conditions remain, and that “some level of rotating outages” may be necessary over the coming days to keep the grid stable.

Copyright 2021 Texas Tribune. All rights reserved.

redditmudder on February 18th, 2021 at 23:34 UTC »

FWIW: Here was my realtime analysis as this was unfolding early Monday morning.

Post Mortem (FEB19 1AM): When I first wrote this post - and the following seven updates - we had zero public information about what was going on with ERCOT. As such, I was drawing on my knowledge as an Electrical Engineer with more than 24 hours experience regarding ERCOT, proposing educated guesses as to what was happening to ERCOT in real time. Hindsight might reveal that some of the proposals I made are not 100% correct, but so far the underlying root cause I theorized in the heat of the moment appear to be mostly (95%?) correct. It will take months to get a complete hundred+ page report.

..................................

Let's clarify why ITE is such a good indicator of system status. Let's treat the grid like a car, since we're mostly familiar with how their gas pedal works. Note that traditional power plants are effectively really large car engines, just like you have in your car, as they fundamentally operate in the same way: if you want to produce more power, you push down on the throttle. If you want to slow down, you let up. Under normal conditions, the grid isn't running at maximum effort, which is equivalent to pushing the pedal down maybe 30% of the way to maintain 60 mph on flat ground. However, as our car approaches a hill, we need more power, so we instinctively push the pedal down some (maybe 40%). Therefore we maintain our 60 mph speed.

But what happens if you're a dork like me that drives a car with 67 horsepower on Texas' highways? In this case, to cruise along the flat road at 60 mph I might need to hold the pedal down 90%. As I approach the same hill, maybe I need to put the pedal all the way to 100%, which just barely maintains the 60 mph setpoint. At this point, I'm still (barely) going 60 mph, and so the grid is still working well (but teetering on disaster). Also at this point, ERCOT would have just issued* their first level conservation notice (out of three total levels). *OK, ERCOT actually has a small buffer, but it's not nearly enough to account for realtime system changes (e.g. a plant going offline, the wind stopping, sun going behind clouds, etc).

Now what happens if the hill gets any steeper? Well, I start slowing down (and ERCOT rapidly escalates through level 2 and 3). At this point I am redlining my car continuously for as long as the hill remains steep. So what happens is I start to slow down... 59 mph... 58 mph... 57 mph... etc. Bringing this back to power plants, they're easily able to maintain 60 Hz unless they're overloaded, which means ITE is a very strong indicator of overload conditions on the grid. 59 Hz is a MAJOR deviation, similar to driving 10 mph on the interstate.

At this point the car analogy starts to fall apart, so let's add the power plant situation back to the mix. In our AC transmission system, every single AC generating device (e.g. power plant, wind turbine, etc) must operate at the same frequency. Any deviation from the overall grid frequency is ultimately turned into heat somewhere (power line, transformer, another power plant, the errant power plant... but certainly somewhere). I could bore you to death by explaining reactance, but that involves imaginary numbers and might interest 1% of the audience asking "why is ITE so important?"

So let's skip the math and just summarize it like this: If a single generating source gets appreciably out of phase, something inside of it will absorb energy (heat). Since we're dealing with an intertwined system that produces enough energy to boil the water out of a swimming pool in a few seconds, it becomes obvious that you can quickly absorb enough energy to explode. This is because we transmit power in AC systems via sine waves... if your sine wave isn't exactly in sync with the grid's sine wave, then the area under the curve (essentially the phase difference between each generator) becomes heat (hand wave to the EEs out there). Heat is bad, so ERCOT agrees that everyone on their grid will operate at 60 Hz. As I mentioned above, it is trivially easy to maintain that frequency (hand wavy here, but this is just a fact so believe it). To illustrate how easy it is to maintain: ERCOT specifies that the frequency should remain between 59.98 and 60.02 Hz under normal operating conditions... we exceeded that parameter by fifty times as I was initially writing the linked post.

So what's so special about 60 Hz? Honestly, nothing, except that it's the agreed upon standard. Why is the speed limit 60 mph? So because of the above, we know that if you are generating on ERCOT, you have agreed to output at 60 Hz, in-phase with everyone else. We also know that if you are not generating in phase (and at 60 Hz), then there is a 100% chance the phase mismatch will inject energy into your generating equipment, causing it to absorb heat (i.e. explode). Therefore, it is of critical importance that if your generator isn't able to "keep up" with ERCOT, then you need to disconnect it from the grid (so it doesn't blow up). If you don't disconnect, then under best case conditions the other generator can actually sink power into your generator, causing it to speed up and lose control. In this case, one large power plant is using it's energy to spin the motor in another one... which isn't useful because the actual consumers (you and me) aren't getting power, because the power plants are busy powering themselves (or exploding).

Under normal conditions this won't actually happen because as soon as a power plant realizes it's out of sync, it will automatically disconnect from the grid. The problem then is that now the power plant is no longer providing power to anyone... ERCOT's capacity gets smaller, which places more load on the remaining generating equipment. Since the remaining equipment was already operating at 100%, it wasn't able to push the pedal any harder - or if it could, it would overload the local transmission lines/equipment - which caused more plants to drop out of phase, and thus disconnect (so they wouldn't explode). This cascaded for a while, until the grid briefly dropped down to 59 Hz... this isn't a "1 Hz out of 60" issue... but rather a "1 Hz out of 0.02 (60.00 - 59.98)." Until consumer demand decreases, this is a downward death spiral. To overcome this positive feedback loop (i.e. nosedive at mach2) ERCOT artificially decreased consumer demand by mandating the utility providers shed load (i.e. turning off your power).

..................................

So the next question I'm seeing a lot is "why do we need to make this time up if we're at 60 Hz now." The answer here is that we don't (for grid reliability). The only reason we need to make up the accumulated error is so that those systems that rely on cycle counting (e.g. clocks) remain accurate. This is trivial for our ITE discussion and we will ignore it.

The key point with ITE is that the total accumulated error (42.6 seconds as of this writing) doesn't matter at all for system stability... only the rate of change matters. Under normal operating conditions, the rate of change is essentially always zero (e.g. 0.001 Hz/min)... any minor discrepancy is easily corrected by adjusting the frequency slightly above or below 60 Hz. However, if the rate of change is -1 second per minute (i.e. the grid is operating at 59 Hz), that is a clear indication that the demand is greater than the supply... the pedal is at 100% and the car can't provide any more power... which causes all the other cars to slow down, too (unless they exit off the highway, or crash into you in a ball of fire).

So we don't need to correct the total accumulated error, per se, except that it's required by law (aside: there is a growing industry trend to abandon this requirement). Only the rate of change matters... ITE changing by even a fraction of a second over any given minute is an anomaly... whereas on Monday morning ITE was changing one full seconds every minute for a good ten minute period, and dropped 30 seconds in the following 6 hours, and then dropped to nearly a full minute over the past several days. Think about that: While you've been huddled in a corner in your flooded apartment for the past two days, the grid only dropped 30 seconds, whereas it dropped 10 seconds in ten minutes early Monday morning... that's how close the grid came to failing early Monday morning.

Thank you for coming to my TED Talk.

Edit: This slightly NSFW picture summarizes ERCOT this week.

SirKeyboardCommando on February 18th, 2021 at 23:21 UTC »

Every week I'm supposed to do a spot check at the power plant, and the one year I blow it off, this happens.

hatsarenotfood on February 18th, 2021 at 23:01 UTC »

There was a post in /r/Austin as this was happening that observed the near disaster.