Analytics Using CDP (Customer Data Platform)

Wipro Tech Blogs
7 min readOct 10, 2023

Dipta Chakraborty, Bhargav Mitra

This paper addresses the challenge of predictive analytics using Customer Data Platforms (CDPs) and Machine Learning (ML). It suggests overcoming CDP limitations by utilizing Databricks, a scalable platform integrated with Spark. The paper emphasizes the need for quick and deep insights from CDP data, highlighting issues like limited computation power, algorithm diversity, and data visualization. It recommends data wrangling, visualization, and MLOps framework for quality insights. The paper also discusses data retrieval speed, advocating data segregation. External dashboarding tools like Tableau and Power BI are preferred for improved visualization. In summary, it advocates the integration between CDP and Databricks for enhanced insights, and quick and deep analysis, benefiting decision-making using customer data.

Introduction:

“It was weekend and John was purchasing items from an e-commerce platform; John was adding up items in the cart, and before checkout- he had left the website and moved to a different site, abandoning the cart.” Can the e-commerce company predict this kind of shopping cart abandonment? Yes, it is possible to anticipate the abandonment of a shopping cart based on user behavior. A Customer Data Platform (CDP) [1] can assist in this process by gathering all relevant user data.

However, it should be noted that CDP alone cannot predict or identify the moment when the user is going to abandon the cart. This is due to its limitations in exploring large amounts of data, conducting feature engineering, and creating machine learning models. Additionally, integrating CDP with real-time e-commerce platforms can be challenging since CDP primarily relies on batch processing.

So, what’s the solution? We need an ecosystem that enables us to analyze vast amounts of data and build ML models on top of it. To handle the analysis of large data, a system that runs on GPU/TPU and scales performance according to data volume is required. Spark provides such an ecosystem, allowing scalability based on data volume and it is fault-tolerant, and we can utilize Spark ML to develop ML models on this platform. Therefore, Databricks might be considered as the most suitable option, as it offers the necessary tools to analyze and explore extensive datasets. It provides an ecosystem that supports the creation and deployment of machine learning models for accurate inference.

To address shopping cart abandonment or any use cases that leverage CDP, the data can be brought into the Spark ecosystem, and Databricks can be utilized to perform essential tasks such as data exploration, feature engineering, model building, and model deployment for AI and machine learning applications. Databricks also offers an MLOps [2] framework that enables efficient operation and maintenance of ML models.

Need:

CDPs are unifying the customer’s record, but to extract the full essence of CDPs, we require to derive insights from the data unified in CDP. Broadly speaking, two types of insights can be extracted from CDP data, namely — quick insights, which can be provided by exploring the data, and other one is the deep insight, which can be derived using machine learning algorithms.

We have delivered insights from CDP for a beverage client, and we have faced some challenges in CDP, such as -

There is a few CDP software which are offering lightweight ML capabilities within their product; also they are providing the platform to run and automate the queries to analyze the data. But from our experience, we think that the computation power is not adequate to handle the large amount of data in CDP, which potentially limits the impact of insights in decision-making.

Based on our experience, the availability of a limited range of AI and ML algorithms poses challenges in addressing various business use cases. For example, the unavailability of unsupervised ML algorithms creates roadblocks in solving segmentation-based use cases.

Quick insights and deep insights require data wrangling, data visualization and feature generation. We have faced challenges in plotting the data in a particular format, e.g., histogram and boxplot. Further, we would require Feature Selection, Model Selection, and Model Comparison to be incorporated in experimental data science, which was challenging. This potentially can affect the quality and accuracy of the resulting models and insights.

Many a time, we need to deploy the ML model in production for different business use cases, such as real-time prediction of shopping cart abandonment, CDP does not provide any placeholder to deploy the ML model for real-time inference, and the deployed model would need MLOps framework to monitor and maintaining it efficiently and effectively. We have faced major challenges in maintaining the ML models in CDP.

Sometimes, we need to generate quick insights by extracting data from CDP, but fetching data using query took hours as the data was huge, and it was taking time to load the data, so we needed to segregate the data and fetch the insight from the segregated data and aggregate after the retrieval.

Business wants to showcase the insights in dashboarding tool like Tableau and Power BI. However, there are some lightweight visualization dashboarding tool available within CDP- those are not powerful and there are some challenges to filtering the data based on some criteria. As there is no PULL mechanism available in CDP, data cannot be fetched from CDP by tools like Power BI or Tableau. For this, we must send the data(PUSH mechanism) to a cloud platform like Azure, from where we can showcase the insights.

In order, to extract insights from CDP data, the most effective approach is to bring the data into a platform where we can perform operations on big data as well as can address ML use cases — here where, we have used Databricks as the solution. We utilized spark queries to obtain quick insights. Additionally, we leveraged Spark ML to create more profound and detailed use cases based on the data.

Approach:

We have considered Databricks as a tool to derive insights from CDP data for the following:

We have utilized Databricks to store the data in different layers — bronze(raw data), silver(curated data) and gold(aggregated data) so that we can manage the data efficiently.

The Spark framework can be utilized to handle big data. Spark provides powerful capabilities for processing and analyzing large volumes of data efficiently.

Databricks offers a range of AI and ML capabilities, it provides users with a notebook instance for experimental data science work. Users can perform tasks such as Exploratory Data Analysis (EDA), feature engineering, feature selection, model building, and model comparison. These capabilities empower data scientists and analysts to conduct in-depth analyses and develop robust machine-learning models.

It is easy to maintain the AI ML model in Databricks as it offers MLOps framework, which improves efficiency, reproducibility, and scalability in managing machine learning projects. We also used mlflow for maintaining the model versions and for transitioning the models from one environment to another, e.g. dev to stage and stage to prod. We have deployed the model for real-time inferences in Databricks, which is a very straightforward approach compared to CDP.

In Databricks, there are readily available connections to popular business intelligence tools, such as Power BI and Tableau. These connections facilitate seamless data visualization and reporting, allow users to create interactive dashboards and reports using the processed data.

Additionally, Databricks can speed up the process of decision sciences for any business as many teams can work on different use cases in Databricks just by creating new clusters as per their need, e.g. one team can work on quick insights, one team can work on customer segmentation use case, one team try to leverage Generative AI for business using a different cluster.

We have discussed the need for Databricks to derive insights from CDP data. The proposed architecture is given below to derive insights using CDP data by leveraging Databricks.

The high-level delivery steps of ML-driven insights using Databricks are:

The illustrated architecture diagram below summarizes the process of delivering insights using Databricks:

Benefits:

The main benefits of using Databricks to derive insights using CDP are the following: -

· Scalability: Databricks offers horizontal scalability, allowing businesses to handle large volumes of data within the CDP. This scalability ensures that the platform can accommodate growing data needs and perform complex computations efficiently.

· Advanced Analytics: Databricks provides powerful analytics capabilities, including machine learning (ML) and deep learning frameworks, enabling businesses to derive meaningful insights from their CDP data. ML algorithms can be applied to identify patterns, make predictions, and uncover valuable customer insights.

· Integration with Ecosystem: Databricks integrates seamlessly with other components of the data ecosystem, such as data lakes, data warehouses, and streaming platforms. This allows businesses to leverage their existing infrastructure and tools while incorporating the CDP data into their broader data strategy.

· Collaboration and Productivity: Databricks offers collaborative features that facilitate teamwork and enhance productivity. Multiple data professionals can work together, share code, and collaborate on data projects within the platform, promoting efficiency and knowledge sharing.

· Integration with Ecosystem: Databricks integrates seamlessly with other components of the data ecosystem, such as data lakes, data warehouses, and streaming platforms, which allows businesses to leverage their existing infrastructure and tools while incorporating the CDP data into their broader data strategy.

Overall, to unlock the full potential of CDP, the Databricks is recommended as the most suitable tool to derive insights and improve decision making.

References

[1] S. Earley, The Role of a Customer Data Platform, IEEE, 2018.

[2] H. H. O. a. J. B. M. M. John, Towards MLOps: A Framework and Maturity Model, Palermo, Italy: IEEE, 2021.

[3] B. H. C. Y. Y. L. K. L. Shanjiang Tang, A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications, IEEE, 2022.

--

--