Jan 2 2012

Brittle railroads

(re-post from Google+ — seriously, that’s where I most ramble now but someone suggested I should make this mutter more permanent, so it goes here)

I think I stumbled on what I dislike about “railroaded” game campaigns and, as usual, it’s by way of an analogy.

First, railroading is obviously a continuum. It’s a kind of failure in scenario design but it’s not a make-break failure. As with any design defect (I’m thinking of other design contexts, like hardware or software, and there’s the analogy pointed out for those of you that don’t like solving my little puzzles) it’s not necessarily catastrophic in itself but rather makes the follow-on work (the implementation and the maintenance) harder and that’s what’s a problem.In system design we’d call this a problem of brittleness: increasing railroading makes the design more brittle. That is, it’s less resistant to unanticipated events. It’s harder to change a small part without impacting other aspects of the design. So if a railroad (and this is funny to me since I design real railroads sometimes) is a brittle design, maybe the reasons for brittleness in system design are similar?

Usually it’s a problem with coupling. You see coupling errors (they aren’t really errors, but from my perspective they break things so I flag them as errors) in software all the time and often they are a result of Bad Laziness (distinct from Good Laziness): part of the problem is hard to solve so the designer makes it someone elses problem and now an inappropriate subsystem has to do work that impacts the appropriate subsystem. Now changing one black box affects the functionality of another in ways not covered by the interface spec. This is brittle: I can’t change the design of one component without investigating other components. In a big system this becomes a whack-a-mole game worth millions of dollars and thousands of ergs of customer good-will. Brittle is bad.

Coupling is the problem in railroading, too. Events later depend on the outcome of events earlier in a way that is inappropriate. In system design we’d solve this in a way that’s useless to a scenario designer though: in a game we want more flexibility with less dependence whereas in a system I’d just lock down the functionality and the interfaces and analyze for coupling, moving functions and features as necessary to recover the design. In a scenario this might be boring as hell. It might not even be possible. I’m not even sure where the analogy goes if you head down that route.

The coupling in scenario design happens when a planned scene can only happen if a previous scene resolves in the way predicted by the designer. This strikes me as a red flag right out of the gate: the scenario design depends on reliable prediction of the future. You need a fair amount of magical thinking to believe that will work. When trying to plan for the future (this actually relates to safety design methods by the way) you can try to enumerate all possible outcomes and address each, of course. This will not work. It’s a novice’s first guess at solving this kind of problem and it fails because you will miss something.

What you can do, though, is categorize the kinds of future events rather than plan in detail, and create plans that are similarly categorized. If the villain is thwarted in this scene then we need some kind of new threat. If the players decide their characters are interested in another direction of exploration then we need something there to explore. This leads to general solutions: I need a map that extends in all directions. I don’t need to know in detail what’s everywhere, though, I just need some tools to slow pace until I can go home and plan the next session (here’s where I fall in love with random encounters, by the way — they don’t need to fit the plot because everyone already knows they are random and, frankly, if the players cleverly find a way in which the encounter is consistent with the plot, well, yoink, I am totally using that). If they are disinterested in the objective I thought was interesting, then I need a few ways to make it interesting (your tool here is the character sheet: what did they say was interesting?) and see if those work. If not, listen and delay!

Anyway, I admit that’s just rambling and not an argument. Railroading is brittle. That’s why it sucks. Not sure if any of that other stuff follows.


Mar 16 2011

Safety is suddenly a lot more interesting

Due to events in Japan, which I don’t need to remind you about, safety as an engineering exercise is a lot more interesting than it was a couple of weeks ago.

Safety calculations have many different forms and many proponents for each, but by far the most common is a simple “expected return” calculation. That is, you estimate the chance of each hazard and multiply by the cost of it happening, and using basic probability come up with an expected cost per year in dollars. Well no one actually uses dollars except the insurance companies, but it’s worth pointing out that the insurance companies are the best at this. No one else is really willing to take the PR hit of saying a life is worth X dollars, though.

Anyway, there are others that say that this is inadequate and that remote but catastrophic events need a different weighting than what I’ll call “operational events” because their ramifications are deeper. What we are seeing in Japan and in particular at the Fukushima plant is that they are totally right. And this affects reactor design choices deeply.

The Boiling Water Reactor (BWR) that Fukushima uses is designed to operate at fairly low temperatures (the steam generating power is only around 250 degrees Celsius) and therefore at relatively low pressures. This makes the operational risk fairly low because only low pressures are involved. And that means you can make the pressure containers weaker because, well, there are not very high pressures involved. And that’s very inexpensive. And that’s very attractive. And safe!

In a “black swan” event, however, like a 9.0 Richter earthquake and accompanying tsunami, operational pressures are irrelevant. Thinks get shook and smashed and the internal temperatures (and consequently pressures) are no longer related to the intended operational conditions. They are now only related to the possible configurations allowed by the laws of physics. Now, these are happily constrained by other BWR design elements like the kind of fuel and so on, but nonetheless they are rather more extreme than the operational safety mitigations protect against. And so now that weaker pressure vessel is looking pretty crap. But, hey, black swan events are in the one-in-a-million range of probabilities! So the ER math works out. It’s worth the risk.

Well, maybe not. Now, I am not going to suggest that a Pressurized Water Reactor is necessarily a better bet — it increases operational risks and they are your day-to-day worry. But maybe a BWR with containment designed for parameters closer to the physical maximum would be a better bet? Well, hindsight is crystalline, of course.

Here’s the calculation that would be better, though, than the chance and cost of a reactor failing versus the cost of making it and the profit generated by it: the cost of a brand or a company failing while an already desperate population gets an extra dose of desperation. I know that sounds mercenary, but I want to find a way to make a fiscal argument so that it’s heard well because industry soon forgets ethical arguments. But cash flow they do not forget.

So, ignoring the safety concerns for a single plant, let’s look at what General Electric actually puts at risk by adopting too simple a calculation (and I am not saying that’s what they did — they may well have done one more like what I propose and it still turned out to be worth their money to make things the way they did). GE does not run a single-plant risk at all, you see.

Rather, their risk includes the risk that any plant with their name on it fails in a public and terrible manner at any time. Okay so right away we can see that operational safety is super important, because now the time frame is reactor-years and not just years. So now a black swan of 1 in 1,000,000 per year is 1 in 1,000,000 per year per reactor. The corporate black eye potential of a thousand reactors is now 1 in 1,000 per year. Yikes! Now that is a gamble that sucks!

Maybe. If we’re talking about GE’s perspective, then we can only really count the cost to them. What’s the cost of a GE reactor failing? And what’s the cost of protecting against a black swan event? Worst case for GE is it goes out of business, totally, dissolving all assets to pay fines and suits. Wikipedia says that’s about $48 billion. So your worst case is 1 in a billion every year per reactor to lose $48 billion. Absolute worst case (and we note that in black swan space we are waving our hands pretty hard by definition). So how much safety is it worth building in over and above the basics and without cost to the customer (because you can’t sell him YOUR safety, trust me) per reactor?

It depends on how many reactors you make — the chance of getting blindsided by the black swan increases every time you sell a reactor. It might actually make sense to stop making them at some point, just because you can elevate your corporate risk beyond acceptable levels just because you have elevated the impossible into distinctly possible — even likely — space by having so many risks on the table at once. If a way-overspecced pressure vessel costs ten million extra dollars to GE, that’s ten billion dollars on a thousand reactors! And that’s a ten trillion dollar ER on a one in a thousand event. Against a $48 million ER for not doing it. So even taking the whole corporate net assets as risked, and accounting for a thousand reactors at once, it’s hard to see it making dollars sense using an ER.

Of course, over 10o years, that’s 100 billion versus 4.8 billion. Still not worth it! If it was only a million bucks? Maybe worth it. Maybe. In dollars.

What are the follow-on costs though?

In a black swan event it’s safe to say that the cost of the failure will not just be the immediate cost of the disaster. It’s not a hundred lives lost directly attributable to the accident itself. It’s also a few hundred thousand people without power at a time when they could really use some power — because they are affected by the same event! It’s the psychological effect of having to worry about radiation poisoning at the same time as your town has been reduced to disorganized lumber. These cannot be dollar values and so they are almost certainly not the manufacturers concern, but the must be the customer’s concern, surely. And I think this is at the heart of this calculation — when considering the most remote kinds of event, we are typically considering natural catastrophes that affect the system under analysis. And that means that the failure, in addition to having direct safety implications, also compounds the damage that is going on around the failure. It is a force multiplier on how much everything sucks and does not occur in isolation, practically by definition.

An ER does not capture this. At all. It is completely unequipped for it. So I think in future we are going to see a lot more attention to more complex modeling methods that answer questions like:

How much does this make the causal disaster worse?

How do we handle impacts that are flat-out untenable (infinite dollar value)?

How do we determine when an impact is completely untenable?

At what point in the probability of an event do we have to assume a surrounding disaster that we might be making worse? Is p=0.0001 implying it? p=0.00001? Something else (I think this is right)?1

These things are not easy and not inexpensive and generally companies are not motivated to solve them unless they can make money on it. That is, after all, the only real metric that we use to judge companies. So that means customers must demand that these cases be addressed even though there is a reasonable expectation that it will never happen to them, and shoulder part of the cost. But I think we will see that and see creative solutions — there’s plenty of room to explore impact mitigation as well as likelihood mitigation.2 For a couple of years there will even be a lot of motivation while the memory of these events are fresh.

The curse of the black swan, however, is that the intervals between are often longer than this memory. And so no matter what we learn today, the odds are good that we will have to learn it again.


  1. One of the things that you often see in a safety analysis is a hazard based on equipment failure, and that failure is mitigated by requiring multiple components to fail simultaneously in operation, which is a multiplied (independent) probability. A disaster, however, makes them dependent and I think that’s not modeled for the most part. If you posit a natural disaster, you can practically assume multiple simultaneous component failures and that means no matter how low you can make operational p, it is never lower than the disaster p. And that means you have to mitigate impact to get an adequately low value — p bottoms out at p(tsunami).
  2. It’s worth pointing out that choosing a different power source can be seen as an impact mitigation (certainly if you install a million wind turbines, you have mitigated perfectly against core meltdowns). It’s also worth noting, however, that we heard almost no news about how many were killed when the natural gas processing plant exploded during the tsunami. By that I mean that the actual impact may not change in ways we hope it will if we do the calculation in earnest. But it might.

Mar 7 2011

Sharpening your doors

I recently ran into a mass of traffic on the Traveller Mailing List that revolved around making airlock doors sharp so that you can cut things in half with them. This sort of thing is why it’s a good thing that there’s a mismatch between the mailing address I reply with and the one I subscribed with (and can’t recall): I can’t reply to these things.

In the past I’ve mentioned that some of my duties revolve around safety. Others have to do with security, which is related. The sharp-airlock-problem is a happy coincidence of both. It also underscores the value of a chart I once found in a paper called “Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes” by Ord, Hillerbrand, and Sandberg. Here’s the chart:

Here’s the basics of that paper. Often in safety we need to calculate very small numbers, because we want to make things that have a very small chance of being unsafe. That’s P(X) — the chance that something bad will happen. So we write complex arguments (A) that detail exactly why a given system is has such a very tiny chance (P(X)) of going wrong and such a hug chance (P(~X)) of being just fine. What Ord &c. point out, however, is that when P(X) is really small, it bears recalling that the calculation is really P(X) given that the argument is right. That’s P(A) and we use the notation P(X|A) to indicate “the chance of X given A is true”. So what they point out is that, given that on average 1 paper in a 1000 is retracted from prominent medical journals for being wrong, P(~A) is actually a very big deal. And we really can’t say anything interesting about P(X) if the argument is wrong. Saying there is one chance in a trillion of disaster is fine, but if there’s one chance in a thousand that you’re wrong, then it’s not very compelling. That big grey rectangle stands for “not very compelling”.

So anyway, sharp airlock doors, right. So the argument goes something like this: there’s a chance that a bear, a marine, or a giant squid (all examples from the mailing list, as I recall it, so don’t yell at me) will try to get into your space ship. It would be handy, given this chance, to make the airlock doors sharp so you can slam them shut on your pursuers. The math is presumably that P(squid) is quite high and so the expense is warranted.

Let’s think just a little harder, however. First, let’s agree that if everything in your spaceship goes to hell — the power fails, the hydraulics rupture, and generally everything goes south — you want everything that relies on these systems to remain safe. Now, this is a space ship, so most of the time what is outside it is, well, space. Consequently, the failure mode for the airlock doors has to be “closed”. That means that they should be constructed in such a way that if there is some kind of failure, it causes the doors to close. That probably means that the airlock doors are under some kind of constant tension (a big spring maybe) and that power and other services are used to open them rather than close them.

Okay so being as we’re talking about safety in dire circumstances, we can also see that we probably don’t want this door to close real hard or real fast — you’re just not very safe if the airlock doors carefully close to preserve the air but a) are closed so tight you can never get out or b) close on you as you try to enter to safety, cutting you in half. So, okay, safe failure, means a closing door that won’t kill you.

Already we can see that sharpening the doors might be a bad idea, but let’s continue anyway.

So the argument then is that P(squid) is more likely than P(electrical fault). That is, you are more likely to get attacked by a giant squid than to have a problem that cuts power (or other service) to the door. And keep in mind that both cases have the caveat “while you and/or the squid are trying to enter or exit” since if that’s not the case then just shut the door already and ignore the squid. More correctly, P(attempted entrance by some enemy that we are happy to kill) > P(fault in any service relating to the airlock door).

Okay, that might even be the case. You might run in very dangerous places. And you might also argue that, being as airlock doors are doubled, a pressure failure only happens if both doors fail. If we are pre-supposing no single point of failure elsewhere in the vessel (not a bad assumption) then the chance of both doors failing is the chance of one door failing squared, which is much smaller even than before. So P(squid) might reasonably be pretty high indeed.

Let’s also keep in mind that P(door) is tested every time you walk through the door, though. Now we’re not talking about the chance of pressure loss but rather the chance of you getting cut in half by the door, so we don’t get to square that (you don’t need to get cut in half by both doors). So P(door) is now the chance that you will be killed or maimed every time you use the door. P(squid) is starting to look like a safer bet as you eye that guillotine edge every time you pass by on your way to the service bay for coffee or back into the ship from an EVA. Couple that with the fact that you don’t want that guillotine to operate only when there’s a system failure. You want a big red button at the captain’s console that closes those on any marauding squids. So now one of your failure cases is “captain closes the lock on your sorry ass” as well as simple door failure.

And you don’t want to override that, or even safeguard it much, because of the apparently very high P(squid).

— BMurray

Jan 10 2011

Safety and the Inversion of Folk Logic

Hurray, Brad is going to talk about his field of expertise instead of game design! Well, this is supposed to be a blog about technical things that interest me and games are just a branch of that (yes, games are technical — a technology — and I can blather about that another time if you like) so I’m not averse to going fairly far afield. And who knows, it might be the case that if I ramble long enough I somehow come back around to games anyway.

I was walking from the train station to work this morning and encountered four interesting cases of really crappy risk analysis — three real and one hypothetical. One was accompanied by an epithet that told me exactly why humans are so bad at risk analysis and, at the same time, why safety design is such a counter-intuitive process. It has to do with the fact that humans think in terms of acceptable risk. In a way, safety design looks from the other side of the glass.

Consider standing at the train platform. There’s a 50cm-wide yellow stripe right at the lip of the platform before it falls vertically to the guideway proper, which is where the train is going to be. I have seen children (and older) stand in the yellow zone and, as the train zooms in, tell their parents it’s perfectly safe, presumably using their survival as evidence. This is logic we expect of children, of course, which is to say, flawed. Deeply flawed.

An evidential argument for safety (I didn’t die that time, or even, no one has died yet) is inadequate. I mean, it’s adequate for you but it’s not adequate for design. You see, that yellow bar does not (again, by design) say, “If you stand here you risk injury or death.” I know, you think it does, and the sign says that, but that’s not how it’s designed and so you are misled into thinking it’s too conservative somehow. You’ve stood in the yellow a hundred or a thousand times and never once been killed.

Rather what it says is, “If you stand on your side of the yellow zone and not in it or, obviously, in the guideway on the other side, then you are as safe as we can make you, which is pretty bloody safe.” That is, technologically, we don’t really know the risk of standing in the yellow zone because it depends a lot on freak configurations of the train, your own stability, and in most cases of actual fatality, whether or not you are wearing a backpack.1 So we don’t try to calculate that. Instead we find a  space where, barring some bizarre circumstance, you are certainly safe. Then we mislabel it so you can deride it in front of your parents or friends.

Here are some other examples drawn from my morning walk. You will notice a recurring theme that is both hilarious and insane and perfectly common. I’ll try to remember to point it out at the end.

The traffic signal that indicates it is okay to walk sometimes displays an orange hand instead of a white or green walking guy. This hand does not mean, “you will be killed if you cross now”, or even “you can reasonably expect cars to be passing through your path now”. It means, “You no longer have been granted safe passage.” That is, it’s the default case and not a special case. The special case is the green guy, which reads, “Okay, it’s your turn now, and crossing at this time and place is as safe as we can make it.” Any time the green guy is not present, it’s a bad idea to cross. I watched a woman in a dreadful hurry cross on the orange hand this morning (and ours has a countdown on it which, even if you read safety warning backwards, can reasonably be read as how many seconds until you are totally dead) with the counter to fatality at 4 seconds. She was dressed darkly and small. She fell (also running heels, but also not running very well) in the middle of the road with two seconds to spare, basically disappearing from sight for many drivers. She was not killed. It was still stupid on several levels.

A crowded sidewalk is a crappy place to ride you bike at high speed. You aren’t especially in danger, but the sort of sociopathy that lets bike riders think this is okay is completely beyond me. You are violating a core premise of the safety design (there won’t be any high speed vehicles on this space ever) and making what should be a certainly safe space no better than the road. Yes, you did not injure or kill anyone. Well done. Fuck you.

I’ve never seen anyone blow through a train crossing with the bar down, but I think people don’t do it mostly out of an aversion to destroying things like the bar or scratching their vehicle. Or maybe they just avoid violating custom or even law. But I did hear a driver loudly proclaim that there was tons of time between the bar coming down and the train going by. He could totally have made it! The bar does not say, “It is certainly unsafe to proceed”. Rather when the bar is up, the message is, “Don’t worry, it’s safe now.” The bar down says, “We can’t guarantee anything.”  There’s a reason why level crossings in Texas often have webcams that the public can view and it’s not a pleasant one. Texas is one of the best places to get killed by a train you think you’re probably safe from. Yay freedom!

People are not stupid. They are badly equipped to manage risk, though, and certainly others have spoken more authoritatively than I can about that. What you can do is recognize that you are bad at managing risk and work within that envelope. Then the risk you manage is, judging by the hurry out there, being late for an appointment. Here’s how I manage that risk: I set the alarm 15 minutes early, and then I don’t run for anything but sport.


  1. Backpacks are an awesome way to piss people off and also get yourself killed. I’m pleased to see a decline in their popularity after so many years of seeing them everywhere. Here’s the problem: a stuffed backpack is an extra 20-50cm of space protruding from your body that is completely outside the limits of your proprioception. You have no instinctive knowledge of where that thing is. That’s why you’re always banging it into people (and you are, even if you don’t think you are, and you don’t think you are for the same reason) and occasionally hanging it over the yellow zone and into the guideway.

Aug 19 2010


I actually know something about safety. I work in a safety-critical industry (automated transport) and deal with it every day. I deal with it as a matter of process and know ways in which safety can be astronomically improved when looking at a system that has not been designed for safety. Now, in our industry, safety means “no one gets injured or dead unless the best possible outcome requires it, and in that case it is limited at the expense of all other factors”. So, for example, sometimes the only thing you can do is stop as fast as possible, and that may injure someone. So you only do that when the alternative is worse. As you might have guessed, there is some probability math in there.

One joy of studying safety (and I’ll stress that it’s not my expertise — we have a department that only analyzes for safety, but we all have to know something about it or we’d never release anything) is that you can apply it to science-fiction stuff and get cool results, like when I did a (flawed, it turns out) simple safety case for Traveller-style anti-gravity systems on starships. This made some unexpected subsystems necessary and several of them would make cool hooks for a game.

This is not about that. This got stuck in my head while walking to work. I have gone on (and on) about it in person to some of you, so please forgive me. Here I go again.

Adversarial activity benefits from unpredictability. When we behave unpredictably, it is more difficult for an adversary to find us, to reach us, and to harm us. Unless one or the other is vastly superior in some essential category, behaving unpredictably (within the margins that protect your strengths, so not just random but random and still taking advantage of being really fast and wanting to get further away) is good.

Cooperative activity benefits from predictability. There are edge-cases, like brainstorming, where you want some creative randomness in order to open new avenues for investigation, but generally when acting cooperatively things are best served when predictability is increased. In creative endeavours this is a weak statement, partially because there are adversarial elements to the process but also because it’s exploratory. In safety-related contexts, though, it is an absolutely hard rule. Safety requires conservatism and part of that is a demand for the best predictability you can get.

Traffic is a cooperative, safety-critical system.

There are funny-but-true ways to say it’s adversarial. These are bullshit. Take it from me, a pedestrian. Adversarial driving is bullshit. It will kill me or some other pedestrian. This is not on.

So, obviously, as traffic (and in traffic I include everyone in the system — pedestrians, workers, emergecny crews, commuters, cyclists, transit, whatever) is necessarily cooperative, it benefits from predictability. So how do we get predictability? Easy, with a process that everyone follows!

Yes, traffic law. Here’s where I wanted to go: no matter how stupid or inconvenient a traffic law seems to you, obeying it increases predictability and therefore safety. Disobeying it — again, no matter what it is! — decreases predictability and therefore safety. There is zero mileage in saying you disobey a traffic law because it’s stupid. It might be. It might cost you minutes a day. I do not fucking care, because when you behave unpredictably by disobeying traffic law, the odds are much higher that you will get a pedestrian killed than pretty much anyone else.

And this means you, too, cyclists. And you also, pedestrian. Buzzing stop signs or walking on a red-hand signal create unpredictability and increase the likelihood that someone will get hurt. And the worst offenders are the highest on the list of likely victims: pedestrians, cyclists, and then motorists.

So while it may feel uncool to obey the law, and it may save you seconds or even minutes, and it may seem like a dumb law, please embrace it. Decide to be proud of following this one set of procedures, no matter how iconoclastic you want to be. In fact, obeying traffic laws is kind of against-the-grain now anyway, a kind of punk straight-edge fuck you to the slackers. Make it yours, be proud of it, and then do it. It’s important.


Dec 2 2009

Making play work

I once did a safety analysis of artificial gravity systems in Traveller spacecraft.

I was tempted to stop there, actually. That’s kind of an article in itself — it’s turgid with meaning and ramifications and questions without even elaborating. Because what I did there (though in the context of play and therefore not nearly as rigorously or detailed as I would at work) was my job, but with a particular kind of science-fiction technology in a particular game setting instead of with my more usual target technology.

This was a great exercise for me. It never actually saw play, but it added a lot of quiet verisimilitude to a game or two — it gave me acronyms to throw around for NPC dialogue that were grounded in a context. It gave me a host of scenarios to explore as play (and really, studying failure modes of technology is practically the definition of plotting a good science-fiction story) and it was fun to do. I guess it helps that I like my job.

It also implied things about technology that I love. For example, there’s a credible argument than in a thousand years we will still use big relays that go THUNK for some things. Here at work we’ve been trying to get rid of them for years, but they remain an incredibly cheap and incredibly reliable way to handle safety-critical switching. There might be something new on the horizon, but beating that much cheap and that much functional is pretty hard.

Anyway, the exercise delivered on three axes: it was fun in itself, it informed play in a way I found fun at the table, and it was useful in the workplace as way to abstract a problem out of its context and think about it from a new angle. So I try to do it when I can.

Another place I get to do work-hobby is in typesetting. I write a lot at work — probably two- to five-thousand words a day. I also build a lot of diagrams, sometimes having to invent new symbology. And so I am often faced with new problems in typesetting to deliver complex material in a useful fashion and that lets me build game-publisher constructions in the context of learning more about my own work. Recent efforts in finding an electronic format that cross-correlates well with print have been fruitful, for example. I have several electronic layouts now that explore the issue from different angles using my work criteria as requirements but my game context as text. Am I playing at work or working at play? It’s a good life, at any rate.

I used to do this in my ungaming period (we call it the Dark Ages around home) as well — I was doing a lot of coding at work and would experiment with new languages and ideas by building gaming tools or IRC robots or something. A lot of code got built and a lot got learned and again I was working-at-play and playing-at-work.

A lot of people don’t do this because their work is not playful. By playful I don’t intend to imply frivolous (see my safety analysis above — the work is as far from frivolous as is possible; lives literally depend on it being right) but rather diverting. Enjoyable. Entertaining. And here’s where I want to link to our Trouble with Lulu recently — it seems likely to me that this lack of play is part of what alienates people from their work to the degree that they choose to become cogs rather than humans in the machine that hires them. But if solving that problem was play, it would have been done better and faster.

There are some highly professional cogs too — not cogs in the sense that they are automatable but cogs in the sense that they elect to be automata at work. I’ve met a lot of dentists like this and, increasingly, computer programmers. They don’t love their work and they don’t engage it playfully and eagerly. They may do it well (though my experience is that they don’t) but mostly they do it adequately. They selected the career fundamentally because it seemed likely to deliver a job with good pay. They get no joy from being at work and they cannot imagine getting joy from work. And consequently they generally look to maximise what does motivate them at work — pay. These people sometimes do a lot of overtime, paradoxically, traiding the leisure they do love for even more pay.

Worst of all are people who must be cogs because a human cog is cheaper to employ than a real cog. These people are de-humanised. That makes them easy to dismiss, but it is them I want to address.

This is what automation (in a broad sense) is all about: de-cogging humans. Because a person that is not a cog is free to be at play, and it is at play that our best thinking happens. So in our office, for example, we have a simple rule for everyone from receptionist to, well, the  top: if you do the same thing over and over again, find a way to automate it. Use your skills or call someone who has them, but turn that repetition into a program that does it the same way every time. Play goes up and error rates go down. The guy who loves hacking little scripts does so, and the guy who hates converting Primavera to Excel the way his boss likes it can now click GO and get it done.

And this is where our future must aim: re-humanising everyone. It’s not something we can plan for completely — it’s not a blueprint for yet another Utopia — but it is a goal worth pursuing at every turn. There is good solid work for humans all over our artificial strata of status, but there is also awful, stupid, automatable work that makes some people have to see themselves as un-human, at least for the work day. We should make a place where everyone gets to be human all the time.

I keep smelling whiffs of Marx and Engels. Hrm, mostly Marcuse, now that I think about it. Recall that criticisms of capitalism are separate from the failed blueprints to fix it. Also recall why human rights are important. It’s that first adjective.


Nov 23 2009

Safe software sucks

I haven’t talked a lot about the recent downtime that VSCA has had with Lulu — currently Diaspora is not available because of a fault in their system. It’s been down for six full days and there is no sign that it will be corrected soon. There’s no sign it won’t either. Basically there are no signs.

So this is an interesting automation failure — a highly automated system stops functioning and offers zero information. That’s bad. But it’s also familiar to me and in the context of familiarity it’s good. Sort of. Half of it is good and half is bad. All of it is bad for Lulu because of a purpose mismatch.

I work in a safety-critical software development environment. I, personally, don’t write safety-critical software but I do review it and analyze it and research ways to make it more functional and more safe. So I know something about it.

I know, for example, that an essential feature of any safety-critical design (hardware or software or both) is “fail-safety”. That is, the idea that if a component fails, it does so in such a way that the result is safe. This is usually accomplished by the equipment constantly asserting the tricky state so that when it stops asserting, the equipment goes to a known safe state. An example of this is the “track circuit” system in fixed-block rail (an antiquated but functional and very cheap system) — basically a current is run through a rail and a relay is connected. As long as current is detected, the relay is closed (and it’s a gravity-open or spring-open relay — constantly asserting “occupied” unless powered closed). When a train comes by current travels between the two rails through the train, short-circuiting the system, opening the relay, and flagging the block as “occupied”. So if power fails, the fail-open relay asserts “occupied” whether or not a train is there because it’s safer to assume one is there than not. If the magnet on the relay fails, the relay fails (thanks gravity and/or spring) open, flagging “occupied”. If a metal bar falls across the rail, it shorts and the block is flagged “occupied”. Basically we do work to keep the unsafe state and anything that interrupts that work (a failure) flags the region as closed, which is safe.

So that’s pretty cool — it’s a remarkably simple principle that can make very complex systems certainly safe. It has a side-effect, though: it’s very brittle. Because so many failures are treated safely, and because the safe state is almost always a shutdown in operation, transient failures cause the system to halt, which is very inconvenient. Worse, marginal failures that are not unsafe often must be treated the same way (or so coarsely detected that the fall under the same category as any other failure) and so you can have perfectly safe situations causing an outage.

This, I think, is what happened to us at Lulu. I’m not saying that they are safety-critical, but they have used a safety-critical design pattern inappropriately or perhaps without attending to the rest of the design pattern: the bit that says “this will have these effects on service and you need to ensure this other thing or you’re screwed”. They appear to have a mechanism that I will call “hold and latch” on error.

This means that when they detect a certain category of error (in this case a printer’s failure to print — and this is not stupid because they contract printers so they can’t just solve it instantly themselves) they hold the process (de-list the item so it cannot be further ordered when there’s a known error in the production pipeline, thus avoiding pile-ups) and latch it (disallow automatic recovery so that it must be verified solved by a human before it can proceed). In rail, this process is used when the guideway detects an intrusion near a platform, because this usually means a human has fallen into the track area. When this happens, a hold occurs (any trains nearby apply emergency brakes and no train motion is allowed in the region) and it is latched (it can only be cleared when authorized personnel have visually inspected the region and reported it clear).

The problem with Lulu is obvious (to me anyway): they have latched it but have no effective way to determine whether or not the hold condition has been cleared. Their printer is not talking back adequately (and if they are supposed to unlatch it, automatically or otherwise, they are not) or is not being timely in clearing the latch or has communicated but Lulu proper is not clearing the latch and restarting service. This is a problem with Lulu because it can afford transient errors: lives are not at stake here. A substantial queue of work can be managed, and relatively cheaply. Less expensive, certainly, than failing to sell product which is, presumably, how they make their money. So treating the queue as a safety case is not helpful here but it’s a seductive methodology if you are very very tightly focused on a single cost. Especially if your focus is so tight that you have not attended to the caveat: you need to clear your state rapidly and correctly or everything stops working.