How to Avoid Firefighting: Data EngineerEdition
Firefighting is a metaphor for when Data Engineers??or IT people in general??are called to fix breaking changes. So when you are informed that there is a system-breaking bug, you need to put out the fire by fixing saidbug.
Frankly, its unavoidable and to be expected. There are so many caveats to consider??and given the nature of our work wherein we are often placed in projects where we learn the tech stack on the fly, its something we do regularly.
This is especially true when we develop ETL pipelines for complex data that have close interdependencies with other entities. However, we can take steps to make our lives easier. It is akin to cars that need regular preventive maintenance to avoid nasty situations on theroad.
If you take these steps every time you start or modify your pipelines, you can save yourself a lot of trouble later on. Putting out metaphorical fires is highly stressful??given that you will be pressured to diagnose the problem, come up with a solution, and implement the said solution in such a short amount oftime.
You might be sick of hearing things like, Oh no, somethings broken. We need this thing fixed or else so-and-so is going to stop working as expected. Can we get this fixed ASAP? What is the estimated amount of time youneed?
Oftentimes managers will ask how much time you need to fix the problem. Youll give your estimate, completely unsure because you havent even taken a look at the situation yet, and then theyll say something like, Okay thats a bit too long. The client needs this ASAP. Can you have it donesooner?
Then, once youve put out that fire, another fire spawns. In fact, perhaps several different other fires will spawn and the cycle continues. Its an endless cycle that sucks you in and bleeds you dry. However, it doesnt always have to be likethat.
Step One: Make Sure That The Client Requirements Are CrystalClear
This is a common pitfall. Usually, clients arent sure of important aspects of their data such as the structure, logic, and interdependencies. This is normal because these business people are paying us tech people to make sense of their spaghetti data in the firstplace.
However, you need to have a very thorough discussion with them during the data exploration phase. Describe to them what their data is, how it would affect other parts of the system, and give them a suggested approach to fixing thisdata.
Make sure to catch inconsistencies with the structure, formatting, and values. These are things that will make your pipeline wonky down the line. If data cleaning is within scope, then thats very good. You will have the time to clean up their data and spot potential data issues early on in the project. However, if data cleaning is out of scope (and it often is), you need to really spot the data issues as early as you can so that the client can take action and provide you with the correct expected data or at the very least, rules with how to deal withthem.
Step Two: Understand the Caveats in the Services Youll BeUsing
This is more of the responsibility of the architect??however, you also need to know the services you will be using intimately. Many times things get lost in translation. For example, if you are used to working with certain databases and then you have to work on a project which uses a database youve never worked on before, you need to know the differences between what youre used to and the new kind of database youll beusing.
Things like Primary Key constraints, sharding, materialized views, and more are common concepts that you will find across the board??no matter the database. However, there are little nuances that are often overlooked. For example, did you know that AWS Redshift does NOT enforce Primary Key constraints? So that means if you were working on Redshift, you need to include SQL statements in your stored procedures that enforce said constraints.
Another example: if you were to use AWS Glue for ETL processing, you would need to understand Glues inherent behavior. Glue jobs are not allowed to run concurrently??that is only one instance of a job is allowed. Therefore, if you were expecting data to be streamed then you need to find ways to process this data in such a way that there will be no simultaneous executions lest your pipeline completely stops on that one failed Glue jobrun.
Its these seemingly little things that we overlook that cause big issues downstream. Sometimes, it even warrants slight re-architecture which is a huge time sink because you would undergo the following process:
- Spot theissue.
- Understand theissue.
- Think of a solution.
- Convince your team that there is indeed an issue and you have a solution.
- Spend time with your team going back-and-forth on other solutions.
- Some members will demand a POC of your solution as proof that itworks.
- After that, go on and finally implement your solution.
- Enable your teammates to implement the solution on their pipelines or other parts of thesystem.
- The team spends time reworking other pipelines to follow the solution.
So its best to skip through all that by understanding what you are working with deeply and intimately. This is easier said than done considering that most times you are pressured with immediately pressing tasks instead of given time to truly enable yourself for the projectahead.
Step Three: Squeeze in Basic Checks Even if it Feels Like You Have NoTime
Ah yes, tunnel vision. Your managers and clients are pressuring you to check off every task in the list correctly and ASAP. This is a very classic phenomenon where the team gets tunnel vision and focuses on a set of goals??leaving little room for basicchecks.
But trust me, put in that overtime even if it may be unpaid to do those basic checks because doing those things early on will prevent huge metaphorical forest fires downstream. In the end, you are accountable for every little and big mistake. So if for example, you find out late that there are duplicates in your data and that you have to trace and remove them, its a bad look for you. Basic checks were forgotten. What kind of a data engineer are you? How much are we paying youagain?
Sarcasm aside, I myself fall prey to this often. Because of the pressure to solve pressing issues, I always forget to do basic checks. I think that blaming collective tunnel vision is valid because in the first place thats exactly what stops you from doing basic checks. When youre working upwards of 12 hours a day with people breathing down your neck, it is very hard to remember to do basicthings.
However, stakeholders and managers do not care. In their eyes, you made a basic mistake with big consequences. So, lets protect ourselves from that situation.
Do the following basic checks at every little addition, modification, and deletion:
- Check for duplicates. Please, check for duplicates. Duplicates are nearly always bad business UNLESS its the expected behavior. Whenever you load data, always do a quick query to check if there are any duplicate rows. If you do find them, understand why they appeared. Then, think of a solution. It could be that whatever mechanism your team has put in place to handle duplicates are not working correctly or that you dont even have a duplicate prevention mechanism atall.
- Check for null values. This is inevitable especially with inherently dirty data. So what you need to do is find every null value, check the raw data to see if those values really are supposed to be null, and if not, understand why those null values occurred. Is it because you did a join and those rows did not satisfy the joining condition? Or perhaps there might be missingdata?
- Check that the data is accurate and complete. Whenever possible, put your processed data and raw data side-by-side. First thing, do a row count. The row counts should match. If not, then it might be a consequence of processing. If the row counts dont match, understand why and find a good justification. Otherwise, if you cant find a good reason, something is amiss. Next, check that the values are consistent. Do you think a column called Barcode should have values that look like 123.123? Nope. Do you think that the average units sold could possibly be negative? Nope. That means something is wrong. Always triple-check these things before movingforward.
Conclusion
These are three concepts that will ultimately protect you. As developers, we will get blamed for every little thing. So lets limit the number of things we get blamed for, and consequently have to fix. If we take our own personal steps??outside of the holy task list that stakeholders are hyper-focused on, we might be donning the metaphorical firefighting uniform less than usual. Ultimately, it will allow us to do better work, be more productive, and spend less overtime firefighting.
How to Avoid Firefighting for Data Engineers was originally published in Startup Stash on Medium, where people are continuing the conversation by highlighting and responding to this story.