Safe software sucks

Posted in think

I haven’t talked a lot about the recent downtime that VSCA has had with Lulu — currently Diaspora is not available because of a fault in their system. It’s been down for six full days and there is no sign that it will be corrected soon. There’s no sign it won’t either. Basically there are no signs.

So this is an interesting automation failure — a highly automated system stops functioning and offers zero information. That’s bad. But it’s also familiar to me and in the context of familiarity it’s good. Sort of. Half of it is good and half is bad. All of it is bad for Lulu because of a purpose mismatch.

I work in a safety-critical software development environment. I, personally, don’t write safety-critical software but I do review it and analyze it and research ways to make it more functional and more safe. So I know something about it.

I know, for example, that an essential feature of any safety-critical design (hardware or software or both) is “fail-safety”. That is, the idea that if a component fails, it does so in such a way that the result is safe. This is usually accomplished by the equipment constantly asserting the tricky state so that when it stops asserting, the equipment goes to a known safe state. An example of this is the “track circuit” system in fixed-block rail (an antiquated but functional and very cheap system) — basically a current is run through a rail and a relay is connected. As long as current is detected, the relay is closed (and it’s a gravity-open or spring-open relay — constantly asserting “occupied” unless powered closed). When a train comes by current travels between the two rails through the train, short-circuiting the system, opening the relay, and flagging the block as “occupied”. So if power fails, the fail-open relay asserts “occupied” whether or not a train is there because it’s safer to assume one is there than not. If the magnet on the relay fails, the relay fails (thanks gravity and/or spring) open, flagging “occupied”. If a metal bar falls across the rail, it shorts and the block is flagged “occupied”. Basically we do work to keep the unsafe state and anything that interrupts that work (a failure) flags the region as closed, which is safe.

So that’s pretty cool — it’s a remarkably simple principle that can make very complex systems certainly safe. It has a side-effect, though: it’s very brittle. Because so many failures are treated safely, and because the safe state is almost always a shutdown in operation, transient failures cause the system to halt, which is very inconvenient. Worse, marginal failures that are not unsafe often must be treated the same way (or so coarsely detected that the fall under the same category as any other failure) and so you can have perfectly safe situations causing an outage.

This, I think, is what happened to us at Lulu. I’m not saying that they are safety-critical, but they have used a safety-critical design pattern inappropriately or perhaps without attending to the rest of the design pattern: the bit that says “this will have these effects on service and you need to ensure this other thing or you’re screwed”. They appear to have a mechanism that I will call “hold and latch” on error.

This means that when they detect a certain category of error (in this case a printer’s failure to print — and this is not stupid because they contract printers so they can’t just solve it instantly themselves) they hold the process (de-list the item so it cannot be further ordered when there’s a known error in the production pipeline, thus avoiding pile-ups) and latch it (disallow automatic recovery so that it must be verified solved by a human before it can proceed). In rail, this process is used when the guideway detects an intrusion near a platform, because this usually means a human has fallen into the track area. When this happens, a hold occurs (any trains nearby apply emergency brakes and no train motion is allowed in the region) and it is latched (it can only be cleared when authorized personnel have visually inspected the region and reported it clear).

The problem with Lulu is obvious (to me anyway): they have latched it but have no effective way to determine whether or not the hold condition has been cleared. Their printer is not talking back adequately (and if they are supposed to unlatch it, automatically or otherwise, they are not) or is not being timely in clearing the latch or has communicated but Lulu proper is not clearing the latch and restarting service. This is a problem with Lulu because it can afford transient errors: lives are not at stake here. A substantial queue of work can be managed, and relatively cheaply. Less expensive, certainly, than failing to sell product which is, presumably, how they make their money. So treating the queue as a safety case is not helpful here but it’s a seductive methodology if you are very very tightly focused on a single cost. Especially if your focus is so tight that you have not attended to the caveat: you need to clear your state rapidly and correctly or everything stops working.

–BMurray

Posted by halfjack   @   23 November 2009

Related Posts

Like this post? Share it!

RSS Digg Twitter StumbleUpon Delicious Technorati Facebook

10 Comments

Comments
Nov 23, 2009
10:40
#1 Fred Hicks :

Crap. :P

Have you tried simply setting it up a second instance of the product, to bridge the gap? Or is that simply unacceptable to you?

Nov 23, 2009
10:47
#2 halfjack :

If I see no motion soon I have two methods to escalate and that’s the second. Both risk the fact that if the data doesn’t actually change, then there’s no particular reason to think that the error will go away, so I’m conserving energy here.

The first is to add a new revision with trivially modified data to the existing Diaspora project. This retains the old project ID which is important because that’s encoding the URL to buy the product and exists in several places, few under my control.

The second is as you describe, though again, unless I anticipate the nature of the print error and correct it with a change in data, there’s no reason to expect a new result.

Nov 23, 2009
15:27
#3 Roger :

Disclaimer: I have no information about Lulu that would suggest to me the following scenario applies to them, but I’ve seen it in action in similar organizations.

So. Let’s say you’re a manager and you’re running this publishing function. Your bosses get together and decide that responsiveness is an important metric, so they’ll be measuring how long it takes for your group to ship a product once it’s been ordered. They give you some target to meet. Start missing that target, and you’re looking at lost bonuses and maybe a lost job.

Then something bad happens. It’s out of your control, but no one really cares about that. There’s a problem and there’s no way you can meet your targets. What do you do?

One thing you could do is stop accepting orders entirely. No product ships and no targets are missed. You may have just saved your job.

Of course, no one is happy with that solution. Not you, not your bosses, not the clients. But it’s an obvious response to the reward system that has been implemented. And the behaviour won’t change unless the system changes.

Nov 23, 2009
15:33
#4 halfjack :

Roger, you’ve actually anticipated a post that’s half-written in my head — why bad metrics are worse than no metrics. What you offer by example above is a stellar instance of measuring the wrong thing. Here’s the rule I use: if you measure something and tell employees that if that measurement improves then they will be happier, that measurement will improve.

There is no assurance that what you actually want to happen will indeed happen. But the numbers you asked for will steadily improve.

Nov 24, 2009
11:59
#5 Roger :

Must be something in that Western Canadian air.

Nov 24, 2009
17:27
#6 walkerp :

I sense Lulu is one of those companies with very bad vertical communications. The debacle with the foreign postage and now this just has that smell. Are they publicly traded? Anybody seen a management org chart?

Dec 7, 2009
10:09
#7 Tim Gray :

This is probably after the fact now. But my experience of Lulu, and what I’ve heard from others, is that the staff tend to be tech geeks with a whizzy service. When everything “just works” it’s great. When something goes wrong they don’t have the communication and customer service skills to provide a good customer experience.

Dec 7, 2009
10:37
#8 halfjack :

Tim, I’m not convinced that the tech geeks at Lulu are involved in customer support at all any more, and that might actually be a new problem for them: the way they have tightly corralled contact with their support personal (and the number of “hand-offs” in this particular instance) suggest to me that they are sub-contracting support with all that that entails. The identification with the actual corporate interests has been set another level below even technical interest.

Trackbacks to this post.
Leave a Comment

Name

Email

Website

Previous Post
«
Next Post
»
Powered by Wordpress   |   Lunated designed by ZenVerse

Bad Behavior has blocked 58 access attempts in the last 7 days.