APAC Startups Analysis

9th June 2019
Google maps: APAC

Project background

The objective of this data science project is to discover early-stage (<= Series A) Asia Pacific (APAC) startups with the potential to become a unicorn (> $1 billion USD valuation).

Tech objectives:

  • To scrape data from an infinite scrolling website
  • Interactive data visualizations

I am a globalist. Before I look beyond APAC, I want to have an overview of my surroundings.

I am interested in startups that:

  1. Build unique product / service
  2. Consist of 1st class executioners
  3. Solve near mission-impossible level challenges on the path to become a unicorn
  4. Are not an "uber" of something

Tech in Asia (TIA) is a Y-combinator alumni Singapore startup. The Asia version of Techcrunch and Crunchbase.

TIA plays a part in helping Asia tech scene, so I will not share the data nor the method I used to get it:slight_smile:

Extract, Transform, Load (ETL)

From the data, I managed to reverse engineer and get parts of the TIA database schema.

TIA has invested decent time in compiling the companies data as this is the cleanest data I have scraped.

TIA updates the data regularly. The number of companies increases from 57306 to 57620 between 2019-05-26 and 2019-06-01.

Based on the data, I do see some problems, so it's not perfect.

I manage to gauge the strength of TIA's Data & Software Engineer from this process.

Choosing data visualization tools

#1. D3.js

D3 is a JS library for manipulating documents based on data. It helps to combines powerful visualization components and a data-driven approach to DOM manipulation.

D3 is the de facto standard for building complex data visualizations on the web.

Usually, engineers use backend language (Python, Ruby, Java) for the ETL process and pass the finalized data to the frontend. With D3, you can do ETL at the frontend directly.

D3 is a low-level JS library with a steep learning curve. To fully utilize D3, you need:

  1. Mastery of data structure: You will be dealing with deeply nested data (array => object => array => object)
  2. Mastery of JS: You will need to build your custom JS code for ETL purposes

The data to feed to D3 will bloat my web app. I am outdated (v3) with D3 (v5). I do not have the time nor interest at this moment to update my D3 knowledge.

#2. R

R is a language and environment for statistical computing and graphics.

I use R extensively when I was pursuing my Statistics bachelor degree. But I completely forgot R:wink:. It will take me an hour or two to pick it back up if I wish.

R is good for academia and researchers but not for a Data Scientist /& Software Engineer because it's too limited. Using Python, I can do everything that R does and more.

#3. Qlik

Based on my experience using Qlik (in 2015), the performance is horrible even if you only sync specific tables from the database.

The charts available are limited and not modern.

#4. Tableau

Similar problems with Qlik.

I'm not going to pay for a BI tool unless:

  1. It has proprietary, out of this world data visualization
  2. It builds a complex chart faster than I can code => which I doubt so:wink:

My chosen tool: Bokeh

Bokeh is a python interactive visualization library that targets modern web browsers for presentation.

It is a tough decision choosing between Bokeh and Plotly. I choose Bokeh because it has a stronger community.

Startups countries - Choropleth map

Please watch this video to understand how to interact with this visualization:

This choropleth map is zoom to Singapore by default. I do this because:

The grey region on the map implies that I do not have data for these countries. Please click on the wheel-zoom button for the mouse scroll to work. The position (x-axis or y-axis) where you scroll the mouse affects the zoom, be sure to place the mouse on the right axis.


Funding series distribution per year - Grouped bar chart

From here onwards, I am only using APAC countries startups data.

Video:

The year used is the founding date of these APAC startups.


Observations:

  • 2015 has the highest number of startups. I am not sure why 2015 is unique, interesting
  • The number of startups decreases from 2015 across all the different funding stages

Funding series distribution per category - Grouped bar chart

Video:

This visualization helps to identify outlier startups based on their uncommon funding raised for their series.


In the seed round, 48 startups raised between $10-$100 Million USD. 1 startup raised $100-$500 Million USD. There are a few explanations:

  1. TIA gather the wrong data
  2. The number is not in USD
  3. These startups are truly understanding

If these startups are so understanding, getting their shares will instantly make you a millionaire on paper.

Industries distribution per year - Stacked bar chart

For this chart, I only use the year starting from 2000. The earliest year I have seen in this data is 1804.

Video:

The idea is to see if we can observe any trend in the startup industries. Perhaps in late 2017 and early 2018, we see more AI startups due to the hype.


I think industries information is not that critical for entrepreneurs. After all, you wouldn't change your industry just because it's less popular now.

Final thoughts

From the data, I can do more, such as social media scoring and APAC VC analysis, but for this article, I decided to focus on solely the startups.

Now, I have a list of startups to keep an eye on.

In time, when I become a world-class executioner, I would like to get a piece of you:wink: