ΒιΆΉΤΌΕΔ

Data Science without the Perfect Data​: How to make progress in the face of uncertainty

How to creatively address Data Science challenges, emphasizing the use of proxy and synthetic data when ideal datasets are unavailable.

Shoeb HosainAs the Program Director of the Masters of Management in Analytics at ΒιΆΉΤΌΕΔ Desautels, I have the privilege of leading a course titled "Analytics and Solution Consulting."

This course aims to assemble a diverse team of students with expertise in Quantitative Methods, Technological Automation, and Business Strategy.

Our objective is to leverage these skills to address Data Science-related challenges for clients from various industries.

The students take the forefront, while I, along with industry coaches, offer guidance and support from behind the scenes.

Before the consulting engagement begins, I like to share some important insights with my students. One key point I emphasize is the need to be prepared for uncertainty and variability in the internal client data they will be using. I remind them that they might not always receive an ideal dataset, and thus, they must be ready to apply their creativity to develop effective solutions.

I then introduce them to the 5 stages of data building priority that I have refined over the years:

  1. Use internal company data with high veracity.
  2. Use competitor or industry-relevant data as a proxy with scaling.
  3. Use internal company data with low veracity.
  4. Use publicly available paid data that is contextually relevant as a proxy.
  5. Use publicly available free data that is contextually relevant as a proxy with scaling.

If these options are exhausted, I encourage my students to consider a more unconventional approach. I tell them:
"When all else fails, consider building a dataset through the Synthetic Data Generation process."

This idea often raises some eyebrows, and students frequently ask:

"Is this approach acceptable?"
"Isn't the point of Data Science to work with real numbers?"
"Will the client be receptive if we suggest this?"

I let these questions linger, allowing them to ponder the concept.

Interestingly, industry professionals often respond with similar skepticism.

Many believe, β€œThis won't provide any value...”

However, the point being overlooked is that Synthetic Data is not a replacement for real data but can be an essential bridge to achieving robust Data Science solutions.

Reaching the highest stages of data quality and quantity typically takes years. The common reaction is to wait, which can result in missing crucial opportunities and delaying value creation. Instead, I advocate for making incremental gains using proxy or synthetic data to enhance the likelihood of future success.

Data generation can be a valuable tool in a data scientist's arsenal. It is important to remember that it should be used as a temporary measure while working towards collecting comprehensive, real-world data.


Master of Management in Analytics (MMA) Program

The ΒιΆΉΤΌΕΔ Master of Management in Analytics (MMA) degree is a specialized program in the evolving field of analytics and data science with a strong emphasis on applied and experiential learning through our 3 Pillars: Quantitative Methods, Technology Automation and Business Application. The MMA program touches on many foundational topics that comprise Artificial Intelligence (AI), covering key areas such as Language Modeling, Image Recognition, Analytic Visualization and Data Architecture Automation.Β Through our comprehensive curriculum, students gain practical skills and knowledge essential for tackling real-world challenges in the rapidly advancing domain of AI and analytics.

More on the MMA Attend an info-session

Back to top