Essay on fault-tolerant systems (part 2) – Bad organizations can kill your products


As I mentioned in part 1, large-complex SW can only be done in time by having tons of people concurrently working on it, which ends-up forming an organization of tasks and people. This is not exclusive to SW, but to any industrial product.

I’m pretty sure non-SW visitors will like this post as well, as there’s value that applies to all organizations regardless of their niche.

If your organization is very stable, you can skip this post and jump directly to part 3 (still under development).

Bad organizations can kill your products

“When an organization is unhealthy, no amount of heroism or technical expertise will make up for the politics and confusion that take root.” – Patrick Lencioni

The main organizational killer of any industrial product, such as enterprise software, is lack-of-stability. If your engineers are firefighting issues all the time, there’s never time to do the “right-thing right”.

If you overburden your employees with urgent tasks all the time, and you overwork them consistently, employees start getting stressed about not being able to meet deadlines, not making any progress and not even being able to do routine-things outside of work: engineers, like anyone else, have a personal life and have to pay bills, buy groceries, walk the dog, take the kids out, go to school meetings, sleep, eat, go to the doctor, workout, meet with friends, call their parents, buy shoes, etc.

If you remove all capacity from your employee to even do routine personal stuff, the stress levels will raise and the employee will become tired, sloppy, aggressive and, eventually, will quit.

If your organization have severe stability issues, there’s no point on moving forward with deeper technical changes, as any attempt to make complex refactors to fix the situation will be abandoned as soon as the next hot issue arrives.

You must first fix your organizational issues to stabilize the work-environment. After that, you can stabilize or improve your SW.

Symptoms of bad organizations

You must avoid at all costs enabling, endorsing or condoning a working atmosphere with sustained levels of stress or an unhealthy culture, since they will cause stability issues.

It doesn’t matter if your company advertises lots of “ethics and values” if they exist only on their brochure: the true ethics and values of a company are the ones the management-chain execute every day.

Here are some tips to help you recognize bad-org common symptoms and, hopefully, make something about them:

  1. High attrition levels.
    1. Attrition cause the remaining employees to get even more stressed due the workload being split among the survivors.
    2. Also, there’s always a domino effect: once an employee quits and spreads the word about greener fields, other employees start looking for jobs too. Other employees get anxious and start fearing that the job will not be stable since several persons are leaving the company, which makes them feel an urge to run before the company decides to shutdown the operation.
    3. Replacing an employee can take months, which makes you waste time reading resumes, performing interviews, negotiating salaries, etc.
    4. According to recent studies, it takes up to 9 months of salary just to replace a valuable team member (taking in account all implied costs such as impact on business, recruiters fees, hours invested in interviewing, etc.).
      1. If you split “9 months worth of salary” among 36, you can easily realize the employee could have gotten a 25% raise for 3 years for the exact same money the company would lose by letting the employee leave. Simple math.
    5. Attrition by itself can be caused by multiple reasons such as stress, long-working hours, bad salary, no promotions (or promoting the bad employees), seniors undermining juniors, etc.
      1. Some organizations decide to self-deny that they have high-attrition levels by comparing them to “the rest of the teams” or “to other similar industries” (as if attrition was a good thing if everyone is suffering it too) but, if not properly reversed, attrition is a top morale-killer.
  2. Treat everything as Priority 1, everyday.
    1. If everything in your organization is labeled or treated as “Priority 1”, it means the management just does not care about prioritization at the expense of burning out the teams.
    2. Even if tasks don’t get explicitly labeled as “Priority 1”, some phrases could be used in-lieu-of. Examples:
      1. “Expedite this task ASAP, but these other 15 tasks have to also be done before lunch no matter what”.
      2. “This project has high visibility. If you don’t close these 10 new tasks by today EOD, then upper-management will come and get it fixed for us”.
    3. When things are being correctly prioritized (i.e., priority 1, 2, 3…N), employees can focus on a single task at a time.
    4. However, when management decides to treat all incoming emails, tickets and petitions with the same equal level of importance, only the next-most-visible-urgent-task will be flushed (which is rarely the next most important task).
  3. Long learning-curves for newcomers (regardless of their total years of experience).
    1. If newcomers take a very long time to share the load with the team, the other employees still have to deal with the workload during all the learning phase.
    2. The manager must define an effective on-board process that helps newcomers to adapt quicker, before stress kills the remaining employees that support the operation.
  4. Having hard-dependencies on other teams or services to make your work.
    1. If your team cannot effectively do their work without having hard dependencies on other people, probably you should consider restructuring the teams or defining a black & white charter of responsibilities per team.
  5. Struggling if one single team member resigns.
    1. If it takes one single employee that quits to take your team down to its knees, you either have strong dependencies on individuals (which is a bad thing for an organization), lack of staff (also a bad thing, specially if you can afford more hiring), or you’ve just let your most valuable employee leave (probably for reasons such as “lack of recognition” or “no salary adjustment”).
  6. Lack of experienced people in the team and organization.
    1. Don’t be cheap with your staff: Provide them high-quality training, offer competitive salaries and challenge them with interesting technical projects.
    2. With that, employees will stick around you for longer, which means you’ll be able to utilize their experience in your products better.
    3. Also, don’t hesitate hiring senior engineers. Their experience will help to reduce the issues as time goes.
  7. Lots of meetings (as in “every another hour” or worst) or tons-of-emails (as in “if you don’t constantly check your email every 10-15 mins, something really bad could happen”).
    1. One thing is being collaborative, and other thing is needing to constantly communicate to avoid mayhem.
    2. If cross-team coordination must be constantly reviewed and discussed, then it means no process is being followed and that everything is being resolved on the fly.
    3. If everyone just knew exactly what to do, little communication would be needed.

Have you ever heard developers saying that their company has “poor management” or “too many internal politics” or is “very slow and bureaucratic“? Well, they literally refer all the symptoms described above (and, yes, those only 3 phrases are good-enough from a developer’s perspective to describe the entire situation).

Best practices to fix a broken organization

Set stabilization as a top-priority

Some organizations choose to pretend there’s nothing broken with them and keep assigning operations tasks on the top of more urgent issues.

  1. For example, they keep accepting new projects, instead of stabilizing the current ones.
  2. Whenever lack-of-stability is damaging your team, all work should be focused to reach a stability-level. For example, close all critical SW bugs, stop attending to non-important meetings, stop accepting new work, get more reliable HW, cancel parked projects, reprioritize projects by impact, etc.
  3. Remember that, even if you are making some changes to your SW and HW during this effort, this is still considered an organizational, not technical, task, since the goal is to stabilize the team.
  4. Changes doesn’t have to be perfect, but they have to dramatically reduce support and time-waste so that the team can have a deep breath and regain control.

Reject impossible deadlines

Impossible deadlines yield to progress-slowdown and kills creativity, which makes late-projects even later. Individuals starts taking bad decisions just to meet the deadline:

  1. Technical staff will start reducing quality in SW, skipping unit tests, coding just for the happy-path scenarios, hard-coding business logic, using libraries without appropriate license/legal/technical review, pushing changes directly to production, etc.
  2. Managers will start imposing terrible decisions, such as reducing or eliminating test phases, requesting to skip code-reviews, demanding to patch instead of fix, ignoring input from expert technical staff, ignoring security concerns, ignoring compliance policies, etc.

Avoid multitasking or assigning multiple projects at the same time

Individuals should focus only on one task of one project at a time. Context-switch often costs a lot more than allowing developers to flush one task at a time.

  1. Developers need to have a lot of “temporary data” in their heads while doing their work, from variables names to complex structures, environments they’re pointing to, design ideas, frequent file paths with log outputs, etc. If developers are frequently interrupted with other tasks, they must dump it all in order to load a completely new set of similar data for the new tasks, which can take hours just to regain full throttle.
  2. Context-switch still occurs with “quick tasks”, “5-minutes meetings” or “tasks in-between tasks”, such as being asked to “quickly review a document and send an OK-email while your totally-unrelated-code still compiles”. All those will still cause the developer to lose an hour just to recover full throttle.

Hire the correct staff for your team

  1. Experienced engineers make a difference.
    1. Have at least a couple of senior engineers and architects in your team, asides to other experienced members.
      1. Architects should have opportunity to validate design ideas with someone just as experienced, which helps catching design-faults on time.
      2. Architects must rely heavily on senior engineers to implement core-components. This practice both frees the architects from heavy tasks (allowing them to focus on more components design), and nurtures a learning-path for senior engineers to absorb knowledge from architects in an organic way.
      3. Senior Engineers will eventually be prepared to take other complex projects on their own, which is a gain to the company.
    2. Because of this staff redundancy, the system development can continue as usual (with minimum or zero impact) in case of medical leave, vacations or other absence situation, while still having experienced members looking out the conceptual integrity of the system.
  2. Hire professional QA members with system-test experience. They know better how to test system aggregations and partitions between components than regular developers.
    1. With dedicated QA members, tests can be started as soon as the code becomes present on the continuous-integration builds, which provides faster turnaround rates between developers and testers (a major gain on large-systems development).
    2. Just like with developers, there are many specializations for testers. Some testers specialize on web systems, others on APIs, others on infrastructure, etc. Get the right QA staff for your projects.
  3. Assign a dedicated Project Manager (or program manager, scrum master, etc.) to the development team.
    1. The project manager should remove blocking issues from the rest of the team members, so team members can focus on their tasks as opposed on metawork.
    2. Sounds obvious, but each metawork task causes context-switch, which has been discussed above.
    3. Project Manager should also be in frequent contact with other important organizations, such as Legal or IT-Security, so that all compliance questions from the team can be expedited ASAP.

“Culture eats strategy for breakfast.” – Peter Drucker

Once you have reached a comfortable level of stability that makes work-days predictable, you can start applying SW patterns from part 3 (still under development).


In a job I had, the entire company used to be managed by individuals, from the lowest through the highest position, who loved to constrain the work environment in a way that SW developers would consume all their time on everything but SW development. They hired highly skilled developers and paid them as such, but they would then overload the developers with non-development tasks. Developers who were determined on keeping sharp their programming skills would eventually need to leave the company.

Every time a developer vented that their calendar was overbooked and presentation slides, spreadsheets and email client had become their only working tools, I used to tell them “It looks like you fell into the metawork pit”. Some people would catch the term really quickly, but others would stare at me with a face like this:


So, what’s metawork then?

Before writing this entry, I googled a bit and found that others were using “metawork” as a term as well. Just like metadata is “data about the data”, metawork is “work about the work”; some said “metawork is anything you need to do in order to do your work”, which is a pretty over-scoped definition. As a developer, your product is useless if not distributed to your customers so, under that subjective definition, one could easily say that “coding is metawork for a developer, since it’s a prerequisite for performing the actual work that matters: deploying the code to production”, which is pretty lame.

When I coined the “metawork” term, I intended a much more specific definition:

Metawork is each worthless task you do at work that’s guaranteed not to improve your product, career and personal growth.

Metawork is garbage, waste, dilapidation, loss.

Let’s consider the next example of a user-story life cycle (for the sake of keeping this post clean, let’s NOT start a discussion of Waterfall vs Agile, this is just for demonstration purposes):




Can you tell which of the tasks above is metawork? Answer is: None of them, and all of them.


If you paid attention to my definition, you have noticed the “career” part. That means that what’s metawork for some, might not be for others.

For example, for SW developers:

  • Analyze is part of their job description and, the more they do this, the better their skills become. Thus, this is not metawork.
  • Code is a must. Period.
  • Test is on a grey area. For example, if it’s a developer that’s targeting his/her career to become a formal tester, then it’s definitely not metawork. If it’s because the manager is cheap and wants to avoid hiring a tester/QA member, then it’s metawork.
  • Deploy is another grey area. For example, does your company have positions specialized in deploying? Do they have a formal deploy process where only some specific people are entitled to deploy (for example, QA Leader or Product Owner)? Are deploys automated? If it’s a task that takes a lot of time and preparation, it might be metawork.
  • Support would usually be metawork, since it will only keep consuming the developer’s time. One exception would be if the developer is trying to deal with customers’ interactions as part of their career development plan; another exception would be if the code has an error in production and a bugfix is required.

Now let’s analyze the same example from a QA member perspective:

  • Analyze is part of their job description, as they check if incoming changes would break existent requirements, or if the test cases must be updated and such. Not metawork.
  • Code is on a grey area, as some testers automate their tasks, which is a good thing, but coding the final product is metawork.
  • Test is a must. Period.
  • Deploy is on a grey area: Unless only QA members are entitled to deploy after stamping their quality seal on the build, it would be definitely metawork.
  • Support would definitely be metawork.

It’s very easy to get lost in the metawork pit. Companies should pay more attention on how the money they invest in developers’ salaries is being used (or any employee whatsoever).

Furthermore, when companies force developers to metawork by doing “all possible tasks of a SW development project”, they are actually serializing all work, all the time, meaning that they are deploying less and less frequently, which seriously damages the time-to-market capabilities of any business.

Other examples of metawork [for a developer] are:

  • Making the developers to admin their own servers (especially if there are dozens of them; also seasoned developers are typically more expensive than mid-range sysadmins, for example).
  • Requesting developers to justify their own existence (i.e., asking them to create complex business presentations filled with hard data about why they are actually useful to the company and why they shouldn’t be fired, and make that type of presentations one of their weekly deliverables).
  • Supporting a business critical operation while also expecting SW projects to be done simultaneously (context-switch will never permit making efficiently neither job).
  • Perform each job candidate search, phone screening, interviews, hiring, training and such of each position for the team (specially if a team is contractor-based and contracts expire only few months later, keeping this cycle forever).
  • Designate a developer to be also scrum-master/project-manager for a team (being forced to resolve everyone else’s impediments, but also expected to produce code on time).
  • Demand the developer to code hundreds of workarounds to overcome to OS/HW limitations, just to save a few bucks of getting newer servers or upgrading some licenses.


You also mentioned something about “product” and “personal grow”, right?

Just think about this: each time a person is hired by a company, they are supposed to be hired for a specific position, for an specific team, for an specific purpose; usually, the hiring contract specifies many of these things.

As a developer, you would usually sign the contract and switch to a new job because the product they work on grabbed your attention and because you think it will help you growing in your career and as a person, right? RIGHT?

But that doesn’t mean that anything outside of coding has to be considered metawork. There are plenty of things at work a developer could do to improve the quality of the product, or to become a better person while still doing work-related tasks.

Sometimes, a developer needs to do a couple of things in order to stabilize a project before making big code changes. Remediating part of the tech debt is a typical example of this; for example, a developer might find no source control management system was being used and that all code was being stored in a floppy drive  CD-ROM   DVD   USB stick  cloud folder (ouch, I’m getting old(er)), so he might decide it is required to create a private GitHub repo and upload everything in there, then grant access to the team members, document the “source control check-in process”and train the team on this, before actually doing any new code.

  • If you think about this example, the developer is fostering the quality improvement of the final product by enforcing formal source control processes, which allows better handling of code branches, increasing concurrency between developers, sharing code interfaces at sooner stages, easing TDD processes between testers and developers, etc.
  • Also, the developer is not keeping this knowledge only to himself, since he is documenting the benefits of this approach, as well as providing training to all the team. He wants everybody to learn about this and help them growing in their careers, thus making him grow as a person.

Who knows, one day someone would refer to that developer as “the only person who actually cared about teaching others in the team” and maybe, maybe one day he will get a nice job offer from one of his former coworkers who appreciated his labor.

That, my dear readers, cannot be metawork. Personal satisfaction it’s part of the work, and it’s part of the fun.