https://ift.tt/9rnxKW7 How to know when you’ve gone too far We’ve all been there, we’re working on a project and it just isn’t quite comin...
How to know when you’ve gone too far
We’ve all been there, we’re working on a project and it just isn’t quite coming together. The model we thought was going to get us there isn’t making the cut. So, we get tempted to add little bits of complexity, “what if I customize this function just a little bit for this project?”
Other times, we’re starting up a project and brainstorming a path forward. We’re putting big ideas on the board. They’re all at the cutting edge of data science. Everyone is excited, but there’s that nagging feeling that maybe things have gone too far.
As data scientists, we love to be on the cutting-edge. And the most successful data scientists I know love to tinker and improve — they want to be the cutting edge. That’s incredibly powerful. But there are hidden costs to tinkering, to always being on the cutting-edge.
The best solutions weigh the benefits of being cutting-edge against the costs to reliability, scalability, interpretability and the resource availability. The best solutions in data science take a considered approach to complexity at every decision point.
Neither of the situations above is inherently bad. In fact, they usually come from a place of passion to do the best work that we can. But, that passion requires careful self-monitoring. The issue is that we tend to continually add complexity and, often as data scientists, we ignore the cost of doing so. Why?
Because the benefit of complexity is almost always self-evident: we’re adding the module, the code, the sub-model for a reason that is right in front of us. It’s so real we can see, feel, taste and touch it. The cost, however, is often concealed, if only just a little bit. At best it’s our second thought. At worst, we ignore it entirely. So, as we build out a model, we end up tempted to layer complexity on top of complexity, cost on top of cost. When we give in our model becomes an unmanageable mess.
The road to hell is paved with good intentions. Right?
It’s not an unfamiliar problem in other disciplines. Engineers have dealt with analogous problems for decades. In World War II, the US designed Sherman tank used highly interchangeable parts. If one tank was knocked out, parts could be taken from another [1]. The Sherman was a relatively simple machine, if it was damaged in battle it could be quickly repaired in the field [2].
On the other hand, Germany designed the Tiger. The Tiger was powerful, precise and complicated [2]. It was a tank as finely tuned as a Swiss watch. Allied forces deeply feared a fully operational Tiger. The problem was, how do you keep a Tiger running? Precision engineering means customization, of parts and knowledge. If you want to fix a Tiger, you need Tiger specific parts. If you want to fix a Tiger, you need Tiger specific expertise.
So, allied forces produced Shermans at scale and kept them in service continuously. On the other hand, the Tiger cost six times more than a Sherman and damage meant it often had to be abandoned in the field [2].
The Sherman is the data science model built with little customization. Its developer continually monitoring the costs of complexity. Its parts are recognizable, interpretable and fixable for our colleagues. It is built efficiently and gets the job done with the appropriate level of accuracy.
The Tiger is that data science model, so often built with great passion, whose developer was blind to the costs of complexity. It is an incredible machine, when it works. Its precision means it will break. When it breaks only a very small group of people will be capable of fixing it. They will do so at great effort.
But complexity isn’t bad in and of itself. As data scientists our jobs entail some level of complexity. Complexity is a spectrum and we must monitor where we are on that spectrum.
To keep complexity in check, I start a project by setting a complexity limit. This decision is guided by three primary questions:
1) “What is the minimum required accuracy of the model? What is the minimum complexity that can be used to achieve that?” We can’t be blind to the requirements of the project and therefore need to consider that the model will have some degree of complexity. This helps to set a starting point “I know I need at least a medium level of complexity on this project.”
2) “Will this project require some sort of scaling?” If the project is going to scale, then I will be more stringent about the complexity limitations. Sacrificing reliability will be considered a very steep cost.
3) “How much resources are available for the project?” The resources at my disposal will further influence how complex I will let the project become. If I have a lot of time, systems or people available for the project, then I will be more willing to add complexity.
Then, in the design and development phases, I monitor my position along the complexity spectrum. Each time I add a new component or customization to a model I ask myself “did I just add: a simplification? no complexity? a little complexity? a lot of complexity?” This allows me to update my position on the complexity spectrum continuously.
By monitoring that position, I force myself to start making trade-offs. I start to think about how much room I have left on the spectrum and whether each piece of complexity is worth its costs. “Do I really need to add this customization? What are its costs? What are the alternatives?” The key here is that all complexity has a cost, it is just the degree to which it is hidden.
So, the specific costs of complexity I monitor for are:
1) Reliability: “Is this change going to make the model fragile? Does it mean that a small change to the inputs could break it? Or that a small tweak to the code could cause it to crash?”
2) Scalability: “Does this change mean the model is going to become very task specific? Does it mean it will be hard to automate and apply to many different datasets?”
3) Interpretability: “Is this change something my colleagues will understand easily? Enough that they will be able to correct any issues without my help? Will I be able to understand the change when I come back to it in 1, 3 or 6 months?”
4) Resource requirements: “How much effort is required to make this change? Is it going to take a few minutes? Days? Weeks?”
Which means at each decision point I can ask myself, “Am I willing to move that far along my complexity spectrum, while sacrificing that much for the benefit I am seeing?” I’ve forced myself to bring to light the hidden consequences so now I can honestly assess the situation.
You can imagine this playing out in WWII era tank design rooms. It’s hard to think that the Sherman designers were any less passionate about their work than the Tiger designers. They probably would have loved to custom design each component to make it the best it possibly could be for their specific application.
“I know we want to use a common chassis, but if I just make this metal a couple of inches thicker the tank will be much stronger.” That thought must have crossed someone’s mind. But more importantly, the thought “but hold on, this little piece of complexity is going to come at a huge cost to the scalability and repairability of my design” must have also been forced into their mind.
And isn’t that the good intention that so often paves the road to hell for us data scientists? “I know we want to use this function my team member built, but if I just make this regression non-linear my predictions will be much more accurate.” We all have those thoughts, constantly. They aren’t bad thoughts, they’re great. They are the embodiment of the curiosity that makes for a great data scientist.
But it is what happens next that matters. Do you plow forward, layering complexity after complexity, cost after cost until you wind up in data science hell? Or do you give yourself pause to find the hidden costs, weigh them against the benefits and only then move forward?
Get in touch
Contact me on LinkedIn for more perspective on the data science field!
[1] Jesse Beckett, The Sherman Tank — Beast or Bust?
[2] Seth Marshall, The Tiger and the Sherman: A Critical Look
The Hidden Costs of Complexity in Data Science was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Towards Data Science - Medium https://ift.tt/9i8z76q
via RiYo Analytics
No comments