Hi everyone, after some time off during this pandemic, finally I could finish this story. It should be my last story in 2021. Turns out, writing content needs more time and energy, so I decided to release this year and become my first story. So here are some notes that I’ve learned so far:
- Build pipeline as soon as possible (of course with a careful decision and this is only for early data system). In my opinion, a good pipeline is better than a great pipeline. Use cloud service or another tool that helps to build a data ecosystem. What data ecosystem is? In my experience, distinguishing the data transaction and data analytical is crucial. Always remember single version of the truth (SVOT).
- Create a template for users (DA, DS) to transform their needed. Especially for transforming the data warehouse to another form. I would say that in this phase, DA should own that data by themself. So next plan is to ensure that our template is good to go.
- Our user is not about the data team, product manager, ops team, even legal team is our user. So be prepared! Sometimes there is an ad-hoc to help them to tackle the flow from business data outside the data team.
- Pipelining is not about ingestion from DB “A” to DB “B”. Here comes the data quality. How to measure the quality of the data? well, we need metrics for each database.
- We do not just create a pipeline and just leave it run, create a metric to check the pipeline is still in good shape, at least for SLA still in range time.
- Machine learning knowledge is a must. It is not about creating zero to one ML product, but supporting data analysts or data scientists. We need to understand our stack and help them to deliver the service in our data environment.
- Be prepared for service cost, create an alert for that if it would be a concern.
- Collaboration inside the data team should be intents. We need to deliver good data. Separate data roles according to their duties and business units.
- Last but not least, data privacy and data security. If we cannot govern our data in the first phase, make sure that the data is not accessible from outside company. After a few improvements and mature environment, we can build add some SOPs about accessing the data for colleagues. Make sure that this would be first prioritized if the data becomes huge and diverse.
I guess that’s all I can share in this story, see ya in another story!