On Lean ML

An industrial perspective

May 08, 2024

It’s hard to overvalue the field of Industrial Engineering. A mechanical engineer steps back for a moment from his machine and can’t help but notice the “machine” that is the industrial process itself: the mode and flow of production. This machine is what you’ve been mistaking for “society” all your life: it’s how you consume, how you work, and what you are. In a sense, the industrial engineer is the greatest of all engineers since his machine is the greatest of all machines. The recent development of “MLOps” is just a slow (and often shortsighted) realization of this within our field.

The Machine

When thinking of the industrial revolution, it’s common to imagine huge factories of assembly-line workers producing identical products at a massive scale. What you’re imagining, however, is only a specific manner of production originating from 1920’s American car manufacturing. The industrial revolution itself, at its core, is fundamentally about the use of the machine: instead of burning calories to power your hands, you burn coal to power a tool. Cars in the 1890s, for example, were produced by a small highly skilled workforce, that had the expertise to do everything from design to machine operations1. No large factories, no assembly lines, no mass production. The production of the car was still a “craft”, and each produced car was different. Indeed, the concept of “mass production” is independent of machine-use, it answers a broader question: how should people make things? 1920s American-style mass production is only one of the answers, and one that a large part of the world’s industry does not even adhere to anymore, since other (better) answers have been developed since.

The production of ML

We don’t produce cars; we produce Machine Learning solutions. The key insight here, however, is that this difference is superficial. Much like a car, ML solutions consist of many different parts that must be fit together: from data, to models, to frontends. Much like the early production of cars, the early production of ML was undertaken by highly skilled experts who had to design, build, and assemble the entire product themselves as a craft. Finally, much like the development of mass production in car manufacturing, ML is now in a transition phase away from a craft era to a more scalable form of production, along with its division of labor: Data Engineers, Data Scientists, Machine Learning Engineers, Analytics Engineers etc. How did this division of labor work out for the mass-producing car manufacturers? A very telling passage from a famous 1990s MIT study2:

“In this new system […] the newly emerging professional engineers had a direct climb up the career ladder. Unlike the skilled craftsman, however, their career paths didn’t lead toward ownership of a business. Nor did they lie within a single company, as Ford probably hoped. Rather, they would advance within their profession – from young engineer-trainee to senior engineer, who, by now possessing the entire body of knowledge of the profession, was in charge of coordinating engineers at lower levels. Reaching the pinnacle of the engineering profession often meant hopping from company to company over the course of one’s working life. As time went on and engineering branched into more and more subspecialties, these engineering professionals found they had more and more to say to their subspecialists and less and less to say to engineers with other expertise. As cars and trucks became ever more complicated, this minute division of labor within engineering would result in massive dysfunctions.”

Our dysfunctions

MLOps initially developed as an extension of DevOps to deal with the inherent indeterministic nature of two things: data, and probabilistic models. For whatever reason much MLOps talk these days gets caught up in already established practices/tools of software engineering: encapsulating your code and its dependencies into a docker image, designing APIs to allow your ML model to interact with other software systems, automating the build and deployment processes through CI/CD pipelines, orchestrating these containerized applications across clusters of (virtual) machines using Kubernetes. Step back for a moment from your assembly line, little cog, and notice how trivial this all must seem. These matters should serve only as a starting point for the question: how should ML solutions be produced? Answering this question means asking what the different roles, responsibilities, processes, and workflows should be within your team and organization. Which parts of the production-flow add which kind of value? Where are the bottlenecks? This is all language-, framework-, platform-, and infrastructure-agnostic.

Mass production of ML is vulnerable to the same dysfunctions as Ford’s mass production. You wonder if a startup would opt for separate data engineers, analytics engineers, ml engineers, platform engineers, data scientists, research scientists and data analysts, not to mention the project-management roles. You wonder if entire separate teams of engineers would be maintaining analytics platforms or overseeing model deployment/monitoring, all the while unfamiliar with both the data and the models. And most of all, you wonder how much the average client of the startup, interested only in the final product, would be willing to pay to keep this industrial machine up and running. In short, you wonder if this process can be made a bit more lean.

The environmental conditions of Lean

There is really only one way to understand the core concepts of Lean Manufacturing (and not be dragged into useless frameworks such as “Lean Six Sigma”), and that is by looking at the conditions under which it was forced to develop. Lean Manufacturing originates in post-WWII Japan and offers a third answer (next to crafts- and mass-production) as to how things should be made.

Japan was not suitable for mass production 3 :

A lack of capital needed to acquire the necessary large and specialized machines (not only due to the war, but also subsequent American credit restrictions aimed at reducing inflation 4)
Strong labor unions and no large immigrant workforce. This makes the low-paying menial jobs of assembly-line workers in a mass production setting very unattractive.
The domestic market was still small, and in need of a large variety of products. Inflexible mass production is suitable for the exact opposite.
Even if any Japanese mass production would be organized, it would not be able to compete with entrenched western companies that would want to capture market share in Japan and defend their own markets from Japanese exports.

Every “principle” of Lean Manufacturing can be traced back to the need to adapt to these challenges. Where mass production uses expensive single-purpose machines designed by legions of engineers and operated by low-skilled workers, Lean uses highly flexible machines operated by multi-skilled workers who can often perform many tasks that would otherwise have required “specialists” in the mass production setting. Where mass production can only produce one highly standardized product, Lean can produce a large variety of the product (made possible by the flexibility of machines and the skills of their operators). Where mass production can’t afford to stop the line due to its obsession with volume, the smaller batches of Lean naturally lead to earlier catching of errors, and an obsession for continuous improvement (referred to as “kaizen“ in Japanese). In a sense, Lean combines the flexibility of crafts production with the volume of mass production.

The initial visits of Taiichi Ohno (Toyota) to mass production factories in America (filled with all sorts of engineers and “specialists” to complete the full division of labor, relegating the assembly-line workers to nothing more than a necessary minimum of non-automatable labor) are the most telling of all:

“Ohno, who visited Detroit repeatedly just after the war, thought this whole system was rife with muda, the Japanese term for waste that encompasses wasted effort, materials, and time. He reasoned that none of the specialists beyond the assembly worker was actually adding any value to the car. What's more, Ohno thought that assembly workers could probably do most of the functions of the specialists and do them much better because of their direct acquaintance with conditions of the line...”

Lean ML

Lean ML is the application of the principles of lean manufacturing to the production of ML solutions:

Small teams of multi-skilled workers, using flexible yet automated tools to produce a large variety of high-quality ML solutions at a high volume.

Where to start? Lean starts with “value-stream mapping”. That is, map out the flow of ML production in your team, from initial concept to deployment, including processes, tools, and roles. Each with concrete inputs and outputs. This is the end-to-end ML Project Lifecycle, this is the “machine” you should be obsessed with. Value-stream mapping seeks to identify which parts of this machine actually add value to the customer, and which don’t. Which parts are accelerators, and which are bottlenecks. With everything you can ask yourself: would the customer be willing to pay for this?

A good example of an AI-lifecycle can be seen in De Silva and Alahakoon (2022) 5, where the lifecycle is divided into three phases: design, develop, deploy:

This can serve as a good starting point, but not much more than that. These blue blocks in the diagram are not actually saying anything, they are just naming, ordering, and grouping generic stages/concepts. We want to know what these processes will actually look like in our team. An easy example: how do you structurally monitor models in production (point 19 in the diagram)? What type of monitoring would the customer value, and which would be wasteful? If some monitoring alert goes off, what is the team’s process to deal with that? Whose responsibility is it to fix whatever went wrong with the model? Does this depend on the type of alert? Which tools are used in checking what’s wrong? A more difficult example: how can the “experimental” phase of model development be conducted in a transparent and reproducible manner? In asking yourself these questions, you will quickly find out how overspecialization can become a huge complicator in these matters, and multi-skilled workers can greatly streamline the flow.

The “multi-skilled“ worker

The multi-skilled worker is rooted in the valuing of human capital, a characteristic of Lean (again, only a consequence of its environment: strong unions and no immigrant workforce). The assembly-line worker is not seen as a disposable variable cost, but as the core contributor of value, and thus worthy of constant investment, no different from the continuous improvement of the machines and processes themselves.

What are the skills of the multi-skilled worker in ML and Data Science? This is a very straightforward question: Mathematics and Computer Science. These fields, of course, constantly overlap. For example, why is a join-operation more difficult to distribute across a cluster of nodes than a filter-operation? This comes up as a question about a SQL query or about distributed compute, but is actually a mathematical question related to an operation’s communication complexity6. Why will your neural net train faster when using CUDA kernels? This is again a mix of hardware design (GPUs) and the inherent parallelizability of linear algebra. Indeed, without this intersection of mathematics and computer science we would be what we once used to be—and now pretend we are not: statisticians.

Notice how the choice of production paradigm not only changes quality, variety, and volume of the product, but also that of labor. It changes the type of work we perform, how we develop ourselves and what our career ladder might look like. It is, in short, much more worthy of study than any single tool.

Appendix (digressions)

Although somewhat off-topic, it’s good to emphasize the weight of the industrial revolution. For the past 150 years or so, world affairs have mostly been a game-theoretic expression of industrial competition. This topic merits its own article, but here I shall give a quick elaboration. We can use a publicly available (and I have to say impressively compiled) dataset of estimates of CO2 emissions since 17507 to serve as a useful proxy for industrial activity over time. The quintessential example to look at is the industrialization of Germany in the late 19^th century.

On the eve of WW1, in 1914, German industrial production seems to overtake that of the UK (where the industrial revolution once began), in large part due to Germany’s significantly larger population. This is a clear pattern of conflict throughout the 20^thcentury: the industrialization of nations with a larger population is seen as a threat, often leading to some form of conflict. For example, this in turn also summarizes the German attitude, even before 1914, toward the imminent industrialization of the gigantic Russian population8 (which indeed ended up being Germany’s undoing in the long run). A fun-to-read (but not to take too seriously) paper on game-theoretic models of the “balance of power” is by Niou and Ordeshook (1987)9, where the authors specifically model differential growth rates among countries (players).

Sources

Beare, D. (2018) Panhard & Levassor: Pioneers in Automobile Excellence. Amberley Publishing.

Womack, J. P., Jones, D. T., & Roos, D. (2007). The machine that changed the world: The story of lean production--Toyota's secret weapon in the global car wars that is now revolutionizing world industry. Simon and Schuster.

Ohno, T. (2019). Toyota production system: beyond large-scale production. Productivity press.

Dower, J. W. (2000). Embracing defeat: Japan in the wake of World War II. WW Norton & Company.

De Silva, D., & Alahakoon, D. (2022). An artificial intelligence life cycle: From conception to production. Patterns, 3(6).

Yao, A. C. C. (1979, April). Some complexity questions related to distributive computing (preliminary report). In Proceedings of the eleventh annual ACM symposium on Theory of computing (pp. 209-213).

Hannah Ritchie and Max Roser (2020) - “CO₂ emissions” Published online at OurWorldInData.org. Retrieved from: 'https://ourworldindata.org/co2-emissions' [Online Resource]

McMeekin, S. (2011). The Russian Origins of the First World War. Harvard University Press.

Niou, E. M., & Ordeshook, P. C. (1987). Preventive War and the Balance of Power: A Game-Theoretic Approach. Journal of Conflict Resolution, 31(3), 387-419.

Abbaan’s Compression

Discussion about this post