MLOps Series: What is Data-Centric AI development and how this paradigm shift from a model-centric approach is fruitful.

Sameer Mahajan
11 min readApr 18, 2022

--

Data is the new Fuel!

Contents

  1. Introducing Data-Centric AI development
  2. Major Differences between Model Centric AI development and Data-Centric AI development
  3. Advantages of Data-Centric AI
  4. The need for a data-centric Machine Learning platform
  5. Platforms for Data-Centric AI Development
  6. Moving towards Data-Centric AI

In the previous article in the MLOps series, we explored numerous ideas such as MLOps, data drift, concept drift, and how to deal with them while producing machine learning systems. In this post, we will focus on a contemporary method of AI development that has been researched. Several startups have embraced this data-centric approach, and large corporations have utilized the benefits of data-centric AI development for their AI development.

Introducing Data-Centric AI development

The landscape of AI development and study has been dramatically affected by researchers and academics, and cutting-edge AI research has made a significant contribution to AI. However, it is primarily about outperforming the state-of-the-art and achieving maximal model performance on precisely specified datasets. It has always been the case that a small group of individuals work on something innovative, then progressively open-source it. Then there is a rise in the number of people who have access to these open-source tools, eventually leading to better development and tools that are easier to use. Due to extensive research, we have a plethora of model designs and cutting-edge algorithms.

Despite the fact that there are several tools and models accessible for download and use by any GitHub user, the fact remains that 80 to 85 percent of AI projects fail. With the growth of AI and our reliance on AI-enabled technologies, 85 percent of initiatives and ideas are failing. There is a significant gap between proof of concept and the topic of this series, productionizing ML models. During one of his exercises, Andrew NG examined the top AI-related publications. After scanning through their abstracts, he discovered that nearly 99 percent of the studies invested in upgrading the model architecture to enhance performance while keeping the data constant.

Current AI development is majorly Model Centric

We know that an ML system comprises data and ML algorithms, with data accounting for around 80% of the system and ML algorithms accounting for roughly 20%. Considering this, it is evident that much potential can be realized by working more with data rather than upgrading it.

The preceding information aids us in distinguishing between two AI development methods.
1) Model Centric AI Development: This consists of creating empirical tests based on the model to improve its performance. This approach includes selecting the optimum model architecture and training procedure and developing an architecture that works exceptionally well.
The data is maintained constant in the Model Centric approach, while the code around the ML algorithms is updated to increase performance.
2) Data-Centric AI Development: This means changing/improving datasets consistently to improve your AI system’s accuracy.
The data in the data-centric approach is modified while the code for the ML algorithm remains constant.

In some circumstances, upgrading the model architecture or optimizing the hyperparameters does not increase the algorithm’s performance, and the accuracy remains stable. However, when the data quality increases, a considerable improvement in the model’s performance may be predicted. Data quality is determined by labeling, management, slicing, augmenting, and curating ideas. This is clearly proven by the steel detection experiment, in which a data-centric strategy improved performance by about 17%, whereas a model-centric one did not.

Data Centric improves accuracy: Source

Hence for these reasons, big personalities of AI like Andrew NG and Zinkevich are rooting for Data Centric AI Development.

Drake approves LOL: Source

Major Differences between Model Centric AI development and Data-Centric AI development

The following are some of the critical distinctions between model-centric AI development and data-centric AI development:

1. In a data-centric approach, a lot of effort and time is put into increasing the quality of the data on which the model is trained. Instead of focusing on the model itself, more effort is spent on operations such as labeling, manipulating, slicing, and supplementing data. On the other hand, model-centric companies spend much time enhancing and identifying the proper architecture for models.

2. A data-centric strategy thrives in use-cases where data is scarce. The quantity of data is not as crucial with a data-centric strategy, but much work is invested in maintaining good data quality. This allows the models to be trained quicker, better, and more frequently with fewer data points.

3. The data-centric approach to AI is more comprehensive, versatile, and adaptable. It is also more adaptable and scalable than the model-centric approach. This, again, makes it appropriate for conventional contexts with limited data availability, such as healthcare and financial organizations.

4. A data-centric strategy allows for SME knowledge in data preparation and the maintenance of high-quality data throughout the ML lifetime.

5. Using a data-centric approach, we can track the performance of our model and identify areas where it falls short. We may then concentrate our efforts on enhancing the quality or quantity of that specific subset of our data.

Advantages of Data-Centric AI

Overall data-centric approach helps us faster model building, reduced deployment time and improved accuracy of the models. Snorkel AI mentions a few of the advantages like:

Faster development: A Fortune 50 bank developed a news analytics application 45 times faster and with 25% more accuracy than the previous system.
Increased accuracy: A multinational telco improved the quality of over 200,000 network categorization labels, resulting in a 25% increase in accuracy over the ground truth baseline.
Cost Savings: By using Snorkel Flow, a large biotech company saved an estimated $10 million on unstructured data extraction while achieving 99 percent accuracy.

Advantages of Data Centric AI: Source

The need for a data-centric Machine Learning platform

We have been surrounded by tools and software that assist us in developing and building machine learning models, such as TensorFlow, PyTorch, and Scikit-learn. Organizations have successfully adopted these techniques for model building over the years. Despite this, the same organizations must maintain a distinct collection of software to gather, process, and preserve data. Because data is the fuel that drives model performance, there is a need for a platform that aids in both data upkeep and model development. Databricks defines this as Unified Analytics, which essentially brings the widely disparate worlds of data science and engineering together with a common platform, allowing data engineers to build data pipelines across siloed systems and prepare labeled datasets for model building while allowing data scientists to explore and visualize data and collaborate on model building. Unified Analytics provides a single-engine for massively preparing high-quality data and repeatedly training machine learning models on the same data. Unified Analytics also allows data scientists and data engineers to cooperate more effectively across the AI lifecycle.

Platforms for Data-Centric AI Development

  1. Continual

Continual is the current data stack’s missing AI layer. Build predictive models that are always improving — from customer churn to inventory estimates — with no technical or operational overhead.

The data-centric process at Continual includes:

a) A shared, SQL-based feature repository that allows teams to quickly share, iterate, and expand feature use across various models,

b) An automated AI engine for automatically constructing cutting-edge models that exploit all of your data without writing code,

c) MLOps and XAI capabilities built in for tracking data, models, and predictions over time.

2. Snorkel AI

Snorkel provides one platform for programmatic labelling which we will be discussing later and the ability to train models efficiently, improve performance iteratively, and deploy applications rapidly. Customize cutting-edge models by training them with your data and adapting them to new data or goals with a few lines of code. Utilize cutting-edge ML to go beyond simple rules while maintaining the ability to audit and adapt. Thousands of data points may be labeled programmatically in hours, while your data remains in-house and confidential.

3. YData

The YData Platform is a data experimentation environment that provides data science teams with a collection of tools that not only make their job less monotonous and error-prone, but also quicker and ready to scale.

Synthetic data is data that is created artificially that mimics the statistical components of actual data without holding any identifying information, hence protecting individuals’ privacy.

Moving towards Data Centric AI

The paradigm shift from model-centric AI to data-centric AI may be broken down into numerous components that can be integrated with the ML lifecycle, eventually leading to a data-centric ML system.

  1. Suitable Data Checks: This indicates that the data should be capable of answering the precise questions we are investigating. As a result, the data should be relevant to the situation at hand and meet the requirements for quality data for the use case. To eliminate needless bias, the data should be representative of the population and credible.
  2. Data Cleaning: Data cleaning aids in the removal of discrepancies in data. We have all heard the expression “garbage in, garbage out,” which means that if the data we use to train our model is distorted, we will end up with confusing and uninterpretable results. Data cleaning currently extends beyond the basic data cleaning of checking for missing values and other inconsistencies. Data cleaning procedures, in particular, are employed to increase the model’s accuracy. Data cleansing has emerged as a critical component of data-centric AI development.
  3. Data Consistency: In a data-centric strategy, data consistency is critical. Inconsistent data can significantly impact the model’s performance and interpretability. Data consistency may be defined as transforming data that must be consistent over the whole dataset. In a voice recognition system, for example, speech containing starting sounds like “umm” can be transcripted as “Umm…I would like a coffee” or “Umm I would like a coffee,” or it can be removed, and the transcript becomes “I would like a coffee.” There is no correct way to transcribe here, but data consistency dictates that the transcript be consistent and that only one approach be utilized for all data. The same goes for data labeling; a consistent approach for labeling should be applied to all data.
  4. Quality Over Quantity: Rather than increasing the quantity of data and running it through data-hungry models, the data-centric strategy focuses on boosting the quality of the data. Increasing the quality of data yields better results than increasing the amount of data, and it has various advantages, such as faster calculations, lower computing costs, and less latency. This is especially useful in businesses where data is scarce. Again, data collection is an expensive operation; therefore, if quality yields superior outcomes, why bother investing in enormous amounts of big data? We can improve data quality by:
    1. We can collect targeted data on a subset of data where we know our model does not perform well.
    2. Bad data points can be discarded. Bad data points produce ambiguous results, and hence they must be removed to ensure high data quality.
    3. It is easier to maintain good quality data when data consistency is checked.
  5. Handling edge cases using data augmentation: In data-centric development, edge cases must be addressed for a specific use case. For example, in an image classification use case, a model may be unable to detect automobiles with dark backgrounds if trained on photos of cars with light backgrounds. The most straightforward answer is to acquire additional data with dark backdrop automobile photos. However, as previously noted, data collection may be an expensive process, which is where approaches like data augmentation come in handy. Data Augmentation is useful for manipulating picture orientation and adding noise or background to images. We preserve the same label, which increases the model’s resilience against edge cases.
  6. Labeling using programming: Finding a labeled dataset is difficult, and building one is time-consuming and expensive. Manual labeling is practically extinct in the age of big data. Because of the success of ML through supervised approaches, frameworks such as Snorkel assist in labeling your data through programming and comprehending various aspects of the data. Programmed labels allow us to create a labeled dataset in a short amount of time while both protecting data privacy and improving data quality.
  7. Granular Model Evaluation: While evaluating a model’s performance is self-evident, the data-centric method focuses on a more detailed model evaluation. The data-centric method recommends evaluating models beyond accuracy, precision, recall, or F1 scores. We should assess the model’s performance on the subset of data where it performs poorly. For example, in the bags classification problem explained in the previous article, we input the product image and receive the price and brand of the image as a result. We might concentrate on analyzing the model’s performance and determining where the model falls short. If the model doesn’t do well on Louis Vuitton bags, then we could go back to the data phase of our ML lifecycle and think about collecting more Louis Vuitton data or improving the quality of the Louis Vuitton bag data subset.
  8. SME plays a significant role: I’d want to conclude this post by emphasizing the importance of SMEs in a data-centric strategy. The things raised above may be accomplished effectively with the assistance of SMEs. We can assure strong data quality and comprehension of data using SME, which is the foundation of Data Centric AI Development.

Thank you!

In conclusion, we discussed the different approaches to AI development. We stated few advantages of Data-centric AI development over Model-centric AI development. Although, data centric brings great promises we cannot ignore the sheer progress researches have made using model centric approach. We are at a stage where we can apply state-of-the-art models with just a push button and that’s amazing!

Here is my LinkedIn handle!

References

https://towardsai.net/p/data-centric-ai/the-principles-of-data-centric-ai-development

--

--

Sameer Mahajan

Python | Data Analysis | Data Science | Machine Learning | Deep Learning