Shawn Ng

Data Science Journey

Data Scientist VS Engineer
Image Credit


Here is my journey as a data analyst, software engineer, data engineer, and architect.

To simplify things, I will use the term AI to represent machine learning (ML), natural language processing (NLP), computer vision (CV), deep learning (DL).

Data Analyst

I think the job of a data analyst is to analyze and derive business insights from the data.

When I was working as a data analyst, the main tools I used were spreadsheets, business intelligence tools (Qlik, Tableau, PowerBI), and some programming.

Here are the few things you need to do to excel in this role:

  1. Be detail-oriented. You must be able to spot mistakes that others will miss
  2. Understand the business objective. Produce content that will contribute to marketing or sales
  3. Clear communication. Deliver the insights to the target audience in a clear and easy to understand format

After two years of working as a data analyst, I have gotten decent in building monthly reports and dashboards. I have worked with data in various industries (mobile dating, startups accelerator, logistics). I see that most of the data don't differ much, and I felt ready to take on a new challenge. 

Software Engineer

Back then, I was focus on building my data visualization skills. I felt limited by the BI tools. I decided to learn how to create them using JavaScript. It seems like the first step for me to transit into the world of software engineering (SE).

Without much background in SE, I had a steep learning curve. Since the job was to build data visualization for the internal web dashboard, I decided to focus on front-end engineering.

SE is a means to an end, not an end in itself. I told myself not to go down the SE rabbit hole. Eventually, I realized that it's hard to focus on one area in SE and ignore the rest as they are connected. Since then, I am still learning about the different areas of SE.

SE tips:

  1. Learn a language by building something
  2. Don't drift too far and focus on solving the task
  3. Learn by reading experts' code

After working as a Software Engineer that focuses on building data visualization (D3.js), I thought that my skillset was too niche. BI tools such as Tableau are sufficient for most companies to answer their data questions. It will not be cost-effective for them to hire an engineer to focus on building dashboards and reports. I got to pivot into a more general role with higher demands. I knew that data science is my passion, so I got a job as a Data Scientist. 

Data Scientist

In this job, I am both a Data Scientist and a consultant. The company I worked for provides ML software to help financial institutions with fraud detection. As the bank keeps the data center on-premise, I have to work in the bank every day, as a vendor.

As the second Data Scientist in a company with less than 15 employees, I got to work with different aspects of solution delivery. First, I will set up the equipment and install all the essential software required for the ML software. Then I will have to look through and confirm if the bank provides the necessary data. Once verified, I will conduct data exploration, cleaning, transformation, and training of ML models. Finally, when the results look promising, we will present to the client our proof of concept (POC). For POC that succeeds and becomes a production project, I will spend months in a single bank. I work closely with the Software Engineers to iterate and improve the ML software with new insights and outliers.

At the start, I enjoy the job a lot as I get to work with the financial (big) data and different data schemes by the banks. But after 4 POC, when I knew what data points to look for, it becomes a routine. While I still discover small insights some time, I felt that my rate of growth slows down. I felt like a DS generalist. I can do ML and NLP. DL is not required in this company as the financial regulators do not accept a black-box ML solution. Specialization is not required as a simple solution works well enough for most cases.

Understanding the limitations of AI, I only trust it for data exploration purposes. I will not bet my money or life on it to make the final decision. I think only companies like Google have the resources (data, money, brains) to build a reliable general-purpose AI. For a startup, the only way to survive is to focus on building a specialized AI that only does one thing, but the best in the market.

I see myself as an executor than a researcher. Spending the majority of the time reading research papers to improve the model by 0.1% accuracy is not what I want. I know that in the long term, I want to start my own company, SE will be more useful than DS. Hence, I look for a job as a data engineer/architect. It is a role that requires both SE & DS skills.

Data Engineer / Architect

As always, in a small startup, many things are barely functioning. I have the opportunity to design and build the data architecture without obstacles since I am the only person doing it.

With my experience working on both the software and the data team, I can see the gap between the two sides.

To the software team, database design is to make sure the data are normalized, relationships defined, it's clean and fast.

However, to people who don't maintain the database, data normalization doesn't make sense. Why would you split the data into different tables instead of consolidating everything in a table?

To the data team building a ML model, they will need all the available features in columns. What they need is a denormalized data, a complete opposite of what the software team is doing.

In a big company, usually, they will build data warehouses to avoid this problem. The software team can design their normalized data in the database while the data team can get their consolidated data in the data warehouse, problem solved.

In a startup, with limited resources, a data warehouse is not the solution as it will incur extra hardware costs.

My solution is to build both normalized and denormalized versions of the same data inside a database, one version for each team. 

It would not be a problem for me to rebuild if necessary since I am the creator. I am in charge of the data team. I have data scraping, pipelines, modeling, infrastructures, databases, APIs, and R&D projects to handle. With limited time, I know I can live with this compromise. I don't know if this startup will still be around next year, I need to balance the speed of execution and scalability of my design.

Final Thoughts

At the moment, I am capable of building a production data-driven web application myself. I am not an expert in a specific area such as machine learning, and I don't aspire to be one. I am interested in gaining more experience in leading a team, selling the product to customers, and learn how to scale a startup. I think "Head of Data Science" is the closest thing to what I have in mind.

If I work at MNCs, it will take me years to reach that leadership role. Therefore, I lean towards an early-stage startup.

It is not all sunshine and roses. Startups have a different set of challenges, uncertainty, and a high chance of failure.

However, nothing ventured, nothing gained.


Share this!