Thoughts on Craftsmanship & A Philosophy of Software Design Notes

18 min readDec 31, 2022

Book
By Josh Ousterhout reading notes and thoughts

Two skills I relearned in the Big Data Platform team are leadership and craftsmanship. We have discussed leadership in many previous blogs, here in this blog post, we will shift to discuss another critical skill I completely relearned - software craftsmanship.

Most companies emphasize craftsmanship as a key to their engineering culture to build solid and sustainable hardware, however the definition has been vague. The route to craftsmanship is also unclear. On the other hand, almost every single company loves “coding machines”, especially on the application development teams to build features as fast as possible to meet product timeline. Ironically, craftsmanship is often ignored in recognition — we celebrate the launch of a new feature, but rarely mention the quality or complexity of the new products — monitoring, performance, etc. In the book, people who code a lot but not building quality software are called “tactical tornados”. Super acute.

Feature Development

Nonetheless, it is not that easy to experience the pain of missing craftsmanship in a product team, because the APIs you designed are mostly only used by yourself, or your immediate product team (~5 engineers), the critical APIs of a product is actually its “product APIs” — the UX you present to the end users. You still can bring breaking changes to your APIs in the code since it is anyway not used by too many, but you cannot change end users’ expectations. Regarding the feature code, from our past analysis a few years back, there is a 50–60% decay rate year by year, meaning at least half of you code (much more for non infra code) is gone next year. I suspect this is one of the key reasons why move fast is more beneficial to the organization than craftsmanship in many feature development teams. A relevant point is — this is also reflected in promotion discussions. Craftsmanship evidence is the hardest to collect in product teams because positioning the balance between craftsmanship and speedy execution is extraordinarily challenging.

Infrastructure Development

No longer the case in the big data and AI infrastructure land though. The APIs you built will be used by other software engineers and they are “smart” as well. If you don’t build robust and well thought APIs, the experience deteriorates incredibly fast and migrations even for “a single line of change” can take multiple engineer * quarters. A few examples:

Data access schema and format migration — just 1 line of change!. We have been using Hadoop as the OS for our big data system and Avro as our raw storage format for about a decade. The team has exposed hdfs:// & the default AvroReader the customers. As we planned to move to Azure for compute and new data format like ORC or Parquet, the cost to move everything to a new API has been extremely high. From infra team’s perspective, it is just 1 or 2 lines of change from AvroReader to some GenericReader and Writer, which abstracts away the underlying hdfs:// and Avro into a hybrid cloud compatible format like genericfs:// and GenericReader(path: String), but the cost can be huge for the company. Reason being — our AI models are sensitive to the data, if there any data miss or corruption, the models will be broken and impact business. As a result of this, every single AI model (including being developed or .. worse built by engineers who have left company) needs re-validated. Good news is — because of the push for Cloud, we still made the migration happen, after 3 years of time (a lot of time was spent in negotiation and alignment).
Hadoop upgrade — all compatible!. We upgraded the fleet from Hadoop 2.7 to Hadoop 2.10 to leverage a few critical features for scalability and cloud compatibility. Different from previous example, we didn’t expect any API changes for the upgrade and it is supposed to be a transparent upgrade. Well, it still took us 3 quarters to finish the upgrade end to end. There are 2 main reasons is — (1) we leaked some of the internal APIs and sabotaged our own system to a few critical customers. We often say if a system works for 90% of the customers that is good, in reality that is rarely the case, it is actually that 10% of the customers that matter and cause most of the issues. They are normally high priority flows requiring special considerations and often collaborate with infra teams for deeper “optimizations”. These optimizations often come at a cost of tech debt and hacks. For our case, some of the internal APIs or jars have been hard coded for many critical flows for previous site issues or perf issues and we had to sweep out all these mines one by one. (2) we lacked usable distributed system benchmark kit. We have a few siloed benchmark suites that work.. but only 1–2 engineers know how to operate them. Unfortunately, during the upgrade, those few engineers coincidentally all went for long vacation and we lost a good month figuring out how to benchmark our system while waiting for them to come back.
Client library upgrade — just need to bump the version! Yes, from infra team’s perspective, users will continuously bump their versions themselves, or as far as they build their system, newer versions of the libraries will be pulled in. However, you’ll be surprised by the number of products that are in maintenance mode, but critical. There are many critical ETL flows that have been running for years that last longer than the engineers in the team, nobody touches them and builds more and more pipelines on top of that. They are basically stuck with the old libraries. So even if you deprecate a version of a library, the existing flows and services still use them. Service oriented architecture & thin client just makes much more sense on many hindsights.

Now we understand the cost of API changes or upgrades, we kinda also understand the importance of API designs — if you screw up your APIs, you are screwed and will be buried in tech debts. You will spend more time patching your systems and migrating your customers than developing new features or improving your systems — your team will deteriorates in the meantime as nobody is interested in endless migrations.

The Book

There are many tactical design pattern books, but I rarely see one that summarizes the philosophy in one or two lines. “A Philosophy of Software Design” apparently has satisfied my stomach… the core of software design is “simplicity” and “reduce complexity”. I simply can’t agree more with this. The story of this learning is the same as the one from previous blog post -

We built a system named EMS for embedding deployment, it was already a fairly complex system and it works so far. As we are moving our modeling architecture from linear models to deep learning models, we start to deploy large embedding layers inside our recommendation models. The memory consumption has been pretty big for large embedding layers during serving time for two reasons — vocabulary size (basically a map from a meaningful string to an id), and the embedding size is big (billions of parameters). In order to work around the memory issue, our initial proposal was to “extend” EMS to strip out the embeddings from the model and ship them to a Key Value store named Venice (open sourced). The end system diagram before & after:

After training time, we split the TensorFlow model into two portions, the MLP layers (dense layers) and embedding layers. Instead of shipping the whole model to a Model Storage system, we perform a bulk inference on the embedding layers to generate embeddings and push those embeddings (key-values) to a remote k-v store, so the thin model no longer has the heavy embedding layers. During serving time, the inference service queries the remote k-v store for the values and populate to the higher level model inference logics.

This seems like a smart design, but the complexity is beyond manageable especially with hourly push — the whole development cycle runs every few hours. One key challenge in distributed system is to ensure consistency, and for this system, it spans across offline (offline scoring) and online (online inference), and 1 versioned model is split and stored across multiple systems (dense layers in Model Storage, and embedding layers in K-V store). Note that the embeddings here are not features shared across different models, but the unique embeddings for each individual models. Build consistency across so many systems and engineering teams demand heavy coordination and delicate interfaces designs — which is more complicated than the original problem itself.

In the end, we ran in to a months of war room trying to make it work and gave up, the system is just too fragile to handle any sophisticated use case. We stepped back and revisited the problem statement — vocabulary size & embedding too big, and pivoted to use feature hashing and quantization to solve the problem w/o adding further complexity to the pipeline (arguably added complexity to the modeling side, but iteration velocity is probably 10x better so over complexity is still reduced significantly).

Besides the high level design philosophy, the book also has a “provocative” opinion on class designs and a few criticism on some of the Java APIs. We are often obsessed with the creed that a method or a class should not be longer than X lines, and we should break them into multiple modules or files. The author argues the most important thing is never the # of lines but the cognitive overhead of your code.

Excerpts

Chapter 1 Introduction — It’s All About Complexity

Simpler designs allow us to build larger and more powerful systems before complexity becomes overwhelming.

The first approach is to eliminate complexity by making code simpler and more obvious… the second approach to complexity is to encapsulate it, so that programmers can work on a system without being exposed to all of its complexity at once. This approach is called modular design.

For much of the history of programming, design was concentrated at the beginning of a project…waterfall model… the entire system is designed at once, during the design phase. The design is frozen at the end of this phase. Unfortunately, the waterfall model rarely works well for software. Most software development projects today use an incremental approach such as agile development, in which the initial design focuses on a small subset of the overall functionality.

Chapter 2 The Nature of Complexity

Complexity is anything related to the structure of a software system that makes it hard to understand and modify the system.

Your job as a developer is not just to create code that you can work with easily, but to create code that others can also work with easily.

Symptoms of complexity

Change amplification.
Cognitive load. Sometimes an approach that requires more lines of code is actually simpler because it reduces cognitive load.
Unknown unknowns.

Complexity is caused by two things:

Dependencies.
Obscurity. Obscurity occurs when important information is not obvious. Inconsistency is also a major contributor to obscurity: if the same variable is used for different purposes, it won’t be obvious to developer which of these purposes a particular variable serves. The best way to reduce obscurity is by simplifying the system design.

Chapter 3 Working Code Isn’t Enough — Strategical vs Tactical Programming

Many organizations encourage a tactical mindset, focused on getting features working as quickly as possible. However, if you want a good design, you must take a more strategic approach where you invest time to produce clean designs and fix problems.

Almost every software development organization has at least one developer who takes tactical programming to the extreme: a tactical tornado…who pumps out code far faster than others but works in a totally tactical fashion. In some organizations, management treats tactical tornadoes as heroes. However, tactical tornadoes leave behind a wake of destruction.

You should not think of “working code” as your primary goal, though of course your code must work. Your primary goal must be to produce a great design, which also happens to work. This is strategic programming.

The payoff for good (or bad) design comes pretty quickly, so there is a good chance that the tactical approach won’t even speed up your first product release.

The best way to lower development costs is to hire great engineers. However, the best engineers care deeply about good design. If your code base is a wreck, word will get out, and this will make it harder for you to recruit.

Chapter 4 Modules Should Be Deep

One of the benefits of a clearly specified interface is that it indicates exactly what developers need to know in order to use the associated module. This helps to eliminate the unknown unknowns problem described in Section 2.

An abstraction is a simplified view of an entity, which omits unimportant details.

The more unimportant details that are omitted from an abstraction, the better… an abstraction that omits important details is a false abstraction: it might appear simple, but in reality it isn’t.

The best modules are deep: they have a lot of functionality hidden behind a simple interface.

Deep modules such as Unix I/O and garbage collectors provide powerful abstractions because they are easy to use, yet they hide significant implementation complexity.

Red Flag: Shallow Module
A shallow module is one whose interface is complicated relative to the functionality it provides. Shallow modules don’t help much in the battle against complexity, because the benefit they provide is negated by the cost of learning and using their interfaces.

Interfaces should be designed to make the common case as simple as possible.

Chapter 5 Information Hiding (and Leakage)

The most important technique for achieving deep modules is information hiding. The basic idea is that each module should encapsulate a few pieces of knowledge, which represent design decisions. The knowledge is embedded in the module’s implementation but does not appear in its interface, so it is not visible to other modules.

Information leakage occurs when a design decision is reflected in multiple modules. This creates a dependency between the modules: any change to that design decision will require changes to all of the involved modules.

Red Flag: Information Leakage
Information leakage occurs when the same knowledge is used in multiple places, such as two different classes that both understand the format of a particular type of file.

When designing modules, focus on the knowledge that’s needed to perform each task, not the order in which tasks occur.

It is important to avoid exposing internal data structures as much as possible…interfaces should be designed to make the common case as simple as possible.

Chapter 6 General-Purpose Modules are Deeper

The sweet spot is to implement new modules in a somewhat general-purpose fashion. The word somewhat is important: don’t get carried away and build something so general-purpose that it is difficult to use for your current needs.

Questions to ask yourself:

What is the simplest interface that will cover my current needs?
In how many situations will this method be used?
Is this API easy to use for my current needs?

Chapter 7 Different Layer, Different Abstraction

Red Flag: Pass-Through Method
A pass-through method is one that does nothing except pass its arguments to another method, usually with the same API as the pass-through method. This typically indicates that there is not a clean division of responsibility between the classes.

Pass though variable: use context. However, context is far from an ideal solution. The best way to avoid problems is for variables in a context to be immutable.

Chapter 8 Pull Complexity Downwards

It is more important for a module to have a simple interface than a simple implementation.

Configuration parameters have become very popular in systems today; some systems have hundreds of them…you should avoid configuration parameters as much as possible. Before exporting a configuration parameter, ask yourself: “will users be able to determine a better value than we can determine here?” When you do create configuration parameter, see if you can compute reasonable defaults automatically, so users will only need to provide values under exceptional conditions.

Chapter 9 Better Together Or Better Apart?

Separate general-purpose and special-purpose code.

Red Flag: Repetition
If the same piece of code appears over and over again, that’s a red flag that you haven’t found the right abstractions.
Red Flag: Special-General Mixture
This red flag occurs when a general-purpose mechanism also contains code specialized for a particular use of that mechanism. This makes the mechanism more complicated and creates information leakage between the mechanism and the particular use case: future modifications to the use case are likely to require changes to the underlying mechanism as well.
Red Flag: Conjoined Methods
It should be possible to understand each method independently. If you can’t understand the implementation of one method without also understanding the implementation of another, that’s a red flag. This red flag can occur in other contexts as well: if two pieces of code are physically separated, but each can only be understood by looking at the other, that is red flag.

Splitting and joining methods. Length by itself is rarely a good reason for splitting up a method. Each method should do ONE THING and DO IT COMPLETELY.

Chapter 10 Define Errors Out of Existence

The key overall lesson from this chapter is to reduce the number of places where exceptions must be handled; in many cases the semantics of operations can be modified so that the normal behavior handles all situations and there is no exceptional condition to report.

Code that hasn’t been executed doesn’t work!

The error-ful approach may catch some bugs, but it also increases complexity, which results in other bugs. Overall, the best way to reduce bugs is make software simpler.

Exception masking — an exceptional condition is detected and handled at a low level in the system, so that higher levels of software need not be aware of the condition.

Chapter 11 Design it Twice!

Eventually, everyone reaches a point where your first ideas are no longer good enough; if you want to get really great results, you have to consider a second possibility, or perhaps a third, no matter how smart you are.

Chapter 12 Why Write Comments? The Four Excuses

The process of writing comments, if done correctly, will actually improve a system’s design. Conversely, a good software design loses much of its value if it is poorly documented.

Myths

Good code is self-documenting. If users must read the code of a method in order to use it, then there is no abstraction.
I don’t have time to write comments. Good comments make a huge difference in the maintainability of software, so the effort spent on them will pay for itself quickly.
Comments get out of date and become misleading. Keeping docs up-to-date does not require an enormous effort.
All the comments I have seen are worthless. Writing solid documentation is not that hard, once you know how.

Benefits of well-written comments — the overall idea behind comments is to capture information that was in the mind of the designer but couldn’t be represented in the code.

Chapter 13 Comments Should Describe Things that Aren’t Obvious from the Code

Developers should be able to understand the abstraction provided by a module without reading any code other than its externally visible declarations.

After you have written a comment, ask yourself the following question: could someone who has never seen the code write the comment just by looking at the code next to the comment? If the answer is yes, then the comment doesn’t make the code any easier to understand.

Red Flag: Comment Repeats Code
If the information in a comment is already obvious from the code next to the comment, then the comment isn’t helpful. one example of this is when the comment uses the same words that make up the name of the thing it is describing.

Lower-level comments add precision. Comments augment the code by providing information at a different level of details. Precision is most useful when commenting variable declarations.

Higher-level comments enhance intuition. Ask yourself: what is this code trying to do? what is the simplest thing you can say that explains everything in the code? What is the most important thing about this code?

If interface comments must also describe the implementation, then the class or method is shallow.

Red Flag: Implementation Documentation Contaminates Interface
This red flag occurs when interface documentation, such as that for a method, describes implementation details that aren’t needed in order to use the thing being documented.

Implementation comments: what and why, not how!

The goal of comments is to ensure that the structure and behavior of the system is obvious to readers, so they can quickly find the information they need and make modifications to the system with confidence that they will work.

Chapter 14 Choosing Names

Good names have two properties: precision and consistency.

Red Flag: Vague Name
If a variable or method name is broad enough to refer to many different things, then it doesn’t convey much information to the developer and underlying entity is more likely to be misused.
Red Flag: Hard to Pick Name
If it’s hard to find a simple name for a variable or method that creates a clear image of the underlying object, that’s a hint that the underlying object may not have a clean design

Chapter 15 Write the Comments First — Use Comments As Part of The Design Process

Write the comments first

For a new class, write the class interface comment.
Write interface comment and signature for the most important public methods, but leave method bodies empty.
Iterate over the comments till the basic structure feels about right.
Write declarations and comments for the most important class instance variable in the class.
Finally, fill in the bodies of the methods, adding implementation comments as needed.
While writing method bodies, discover the need for additional method and instance variables.

Chapter 16 Modifying Existing Code

Stay strategic!

Ideally, when you have finished with each change, the system will have the structure it would have had if you had designed it from the start with that change in mind. To achieve this goal, you must resist the temptation to make a quick fix. Instead, think about whether the current system design is still the best one, in light of the desired change.
If you’re not making the design better, you are probably making it worse.
The best way to ensure that comments get updated is to position them close to the code they describe.
If information is already documented someplace outside your program, don’t repeat the documentation inside the program; just reference the external documentation.

Chapter 17 Consistency

Examples of consistency

Names.
Coding style.
Interfaces.
Design patterns.
Invariants.

Ensure consistency

Document.
Enforce.
When in Rome.. do as the Romans do. Don’t change existing conventions. Having a “better idea” is not a sufficient excuse to introduce inconsistencies. Your new idea may indeed be better, but the value of consistency over inconsistency is almost always greater than the value of one approach over another.

Chapter 18 Code Should be Obvious

Software should be designed for ease of reading, not ease of writing.

Chapter 19 Software Trends

One of the most important elements of agile development is the notion that development should be incremental and iterative. The increments of development should be abstractions, not features.

I am not a fan of test-driven development. The problem with test-driven development is that it focuses attention on getting specific features working, rather than finding the best solution.

Chapter 20 Designing for Performance

It’s tempting to rush off and start making performance tweaks, based on your intuitions about what is slow. DON’T DO THIS! Programmers’ intuitions about performance are unreliable.

Design the code around the critical path.

Chapter 21 Conclusion

If good design is an important goal for you, then the ideas in this book should make programming more fun. Design is a fascinating puzzle: how can a particular problem be solved with the simplest possible structure? It’s fun to explore different approaches, and it’s a great feeling to discover a solution that is both simple and powerful. A clean, simple, and obvious design is a beautiful thing.

The reward for being a good designer is that you get to spend a larger fraction of your time in the design phase, which is fun. Poor designers spend most of their time chasing bugs in complicated and brittled code. If you improve your design skills, not only will you produce higher quality software more quickly, but the software development process will be more enjoyable.

Summary of Design Principles

Here are the most important software design principles discussed in this book:

Complexity is incremental: you have to sweat the small stuff (see p. 11).
Working code isn’t enough (see p. 14).
Make continual small investments to improve system design (see p. 15).
Modules should be deep (see p. 22)
Interfaces should be designed to make the most common usage as simple as possible (see p. 27).
It’s more important for a module to have a simple interface than a simple implementation (see pp. 55, 71).
General-purpose modules are deeper (see p. 39).
Separate general-purpose and special-purpose code (see p. 62).
Different layers should have different abstractions (see p. 45).
Pull complexity downward (see p. 55).
Define errors (and special cases) out of existence (see p. 79).
Design it twice (see p. 91).
Comments should describe things that are not obvious from the code (see p. 101).
Software should be designed for ease of reading, not ease of writing (see p. 149).
The increments of software development should be abstractions, not features (see p. 154).

Summary of Red Flags

Here are a few of of the most important red flags discussed in this book. The presence many of these symptoms in a system suggests that there is a problem with the system design:

Shallow Module: the interface for a class or method isn’t much simpler than is implementation (see pp. 25, 110).
Information Leakage: a design decision is reflected in multiple modules (see p. 31).
Temporal Decomposition: the code structure is based on the order in which opera-ions are executed, not on information hiding (see p. 32).
Overexposure: An API forces callers to be aware of rarely used features in order to use commonly used features (see p. 36).
Pass-Through Method: a method does almost nothing except pass its arguments to another method with a similar signature (see p. 46).
Repetition: a nontrivial piece of code is repeated over and over (see p. 62).
Special-General Mixture: special-purpose code is not cleanly separated from general purpose code (see p. 65).
Conjoined Methods: two methods have so many dependencies that its hard to understand the implementation of one without understanding the implementation of the other (see p. 72).
Comment Repeats Code: all of the information in a comment is immediately obvious from the code next to the comment (see p. 104).
Implementation Documentation Contaminates Interface: an interface comment describes implementation details not needed by users of the thing being documented (see p. 114).
Vague Name: the name of a variable or method is so imprecise that it doesn’t convey much useful information (see p. 123).
Hard to Pick Name: it is difficult to come up was a precise and intuitive name for an entity (see p. 131).
Hard to Describe: in order to be complete, the documentation for a variable or method must be long. (see p. 148).
Nonobvious Code: the behavior or meaning of a piece of code cannot be understood