Causal Exploration

14 min readApr 12, 2023

A world beyond correlation

By Siladitya Sen, Dipojjwal Ghosh, Soumya Talukder

AI Analysts frequently come across objectively pin-pointed questions from the business such as, “Why are my customers leaving?”. To answer the question, they often reframe the question to “What factors are causing the customers to churn out and discontinue shopping with the business?”. The standard practice of AI & ML relies heavily upon the idea of identifying a correlation between the event and contributing factors and projecting the future based on this relationship. The AI analysts would present the stakeholders with a probable relationship between the customer demographical and social patterns with the existing purchase patterns to gauge the chance of each of the customers’ discontinuation of shopping. But is it that simple? Could the presence of certain behavioural and demographical patterns have contributed to causing the customer to leave?

If we look for the prescribed patterns over the entire customer database, we could find several contradictions with the pattern where those customers continue to shop with the business contrary to the outcomes of the AI model. The presence of a correlation between the factors and the event could have been a mere coincidence. Herein enters the concept of Causality that allows for the identification of causal factors of the event. These factors can in turn be incorporated into decision-making systems to address the problems more accurately. Moreover, the exploration of causal factors can lead to better visibility of the hidden relationships present in the data and eradicate the issue of multi-collinearity prevalent in the case of traditional AI & ML. In a recent survey by Gartner [1], Causal AI is slowly becoming a blue-eyed boy for NexGen AI technologists.

Causality

Causality [2], in layman’s terms, can be defined as the “influence by which one event, process or state, a cause, contributes to the production of another event, process or state, an effect, where the cause is partly responsible for the effect, and the effect is partly dependent on the cause.” However, statisticians have always been skeptical towards the idea of causality over the association, but a significant breakthrough came with the back-to-back research from Rubin [3] [4], Holland [5], Robins [6], Sprites [7] and Pearl [8] [9].

Causation has been considered in these statistical kinds of literature through four broad approaches:

(i) Probabilistic Causality: Causality is referred to as probabilistic statements among defined events.

(ii) Counterfactual Causality: Address the question of “what if something had been different in the past”.

(iii) Structural Models: Assuming that the system of interest is driven by stochastic mechanisms.

(iv) Decision Theory: Addresses the question of making optimal decisions under uncertainty.

With exception of the Probabilistic Causality, all other techniques treat causation as the effect of intervention in one or more variables on the target variable. This is of higher demand to the statisticians since the inherent aim of developing the causality-related field is to devise interventions in the causal factors to alter the course and results of the event.

Figure 1: Causal Intervention to Event [10]

For example, the Montreal Protocol of 1987 was devised based on the causal relationship [11] unearthed between CFCs and the depletion of Ozone that led to the universal ban on the usage of CFCs in cooling technologies and the reduction in depletion of Ozone layers. Nonetheless, there have always been difficulties in determining whether a relationship is merely an association or has some causation infused in it, which can be attributed to a plethora of reasons like confounding effects, Time dependency, Feedback relationships, etc.

Figure 2: Different relationships between variables — a) Effect of Confounding, b) Time dependency, c) Feedback

In the Figure 1(b), the price of bread in the UK and the sea level in Venice have a strong positive relation [12] due to these being governed by time processes without having any association with these time trends. Figure 1(c) showcases the feedback relationship where the association is recorded since they instigate each other. However, in Figure 1(a), Lifestyle has a confounded effect on the increase (or decrease) of Risk factors and the chance of disease. Thus, intervention in the lifestyle can alter the risk factors and chance of diseases that would have been overlooked in the standard association-based approaches.

Our Approach to Causality

In the current scenario, there have been very few breakthroughs in unearthing causality in the data-driven analytical world. However, most of these solutions are proprietary and reside in AI platforms that are highly licensed. Two of these proprietary AI platforms have gained a good reputation — Geminos and CausaLens. CausaLens identifies hidden relationships among the data, transforms causal graphs into predictive models, and concludes decision optimization framework whereas Geminos focuses on automating causal modeling.

Wipro is one of the torchbearers in the field of Causal AI with their Causality Exploration Grid (CEG). CEG is an AI module to excavate the hidden relationships in the data and identify the true nature of those relationships w.r.t the cause and effect that provide an in-depth visualization of the data. We have utilized NoTears Algorithm for the causal discovery and developed an overhead module involving Causalnex to provide the causal inference. NoTears [13] is an optimization-based approach to causality that estimate the directed acyclic graph (DAGs) by minimizing the objective function. Using the NoTears Algorithm, we find the structured network and check for the existing causal relationships among the variables. Causalnex [14] utilizes Bayesian Networks to combine machine learning and domain expertise for causal reasoning. The Bayesian Network is used to find conditional probabilities of the variables that have a causal relation with the target variable.

Illustrative Solution Architecture

Let us try and solve the aforementioned business question using our Causal Exploration Grid tool. We have had data about a retail client looking to identify the customers who are likely to churn. The data was represented to provide the socio-economic, demographic, and lifestyle-related information for the 850+ customers along with their shopping behavior (Frequency of purchase, Average Dollar Basket size, Category of Product purchase, No. of Promotions redeemed) and some information about customer preference in the mode of payment and purchase. The data also contained information about the retailer’s action in including them in promotional activities and whether retention strategies were applied to them. We will follow the below-demonstrated process flow as we try to excavate the factors causing the customers to churn.

Figure 3: Flow diagram of Wipro’s Causality Exploration Grid

1. Obtain Structure Model

Structural Models or Structural Causal Models (SCMs) [3] are conceptual models that are used to describe the causal mechanism of a system. It is used to measure directed dependencies among a set of variables in a particular data set. These variables can be of 3 types — explanatory variables, outcome variables, and unobserved variables. Both outcome and explanatory variables are observed variables i.e., the variables describing processes measured in our data set, while unobserved variables are “background processes” for which we do not have observational data. We can represent explanatory variable as i, outcome variable as d and unobserved variable uₜ. Generally causal relationships, that describe the causal effect variables have on each-other, extend from observed and unobserved variables to observed variables and are represented as i 🠠fᵢ(uₜ). The structural model is developed by looping the probable relationships over the probability distribution defined over unobserved variables in the model, describing the likelihood that each variable takes a particular value by considering the restriction of one-directional process. Thus,

SCMs [4] serve as a comprehensive framework unifying graphical models, structural equations, and counterfactual and interventional logic. Graphical models serve as a language for structuring and visualizing knowledge about the world and can incorporate both data-driven and human inputs. Counterfactuals enable the articulation of something there is a desire to know, and structural equations serve to tie the two together.

We have used CausalNex to learn the SCM (based on Bayesian Networks in our case) from our data as illustrated in below code snippet.

CausalNex utilizes the NOTEARS [13] algorithm which is purely a continuous optimization approach over real matrices to create the Bayesian Network Structure Learning (BNSL) instead of the heuristic combinatorial approach. The Structural model obtained can be interpreted through a Causal Graph and visualized using a Directed Acyclic Graph (DAG), covered in our next step.

2. Visualize Structure Model using DAG

Directed Acyclic Graph (DAG) [15] [16], is a graph consisting of nodes and edges, where the nodes are objects, and edges between them represent a connection between these objects. Each edge is orientated from a parent node to a child node. A path in a directed graph is a sequence of edges such that the ending node of each edge is the starting node of the next edge in the sequence. A cycle is a path in which the starting node of its first edge equals the ending node of its last edge. A directed acyclic graph is a directed graph that has no cycles. DAG can overlay the relationships between different variables and how each node affects the outcome. The SCM in the previous stage can be refined by removing irrelevant edges using edge importance and input/knowledge shared by the domain experts before/after being visualized.

We have used NetworkX to visualize our DAG using the below code snippet.

The code generates the below graph that provides detailed information about the extent of the relationship among various variables. One could identify the sets of influential variables that have a strong impact on the chance of a customer churning out. Some of these influential variables will have the potential to be the causal factor for the event of interest (churn).

From the graph, we have observed that Lifestyle, customer preference (mode of payment & shopping), Category of Product Purchased, Frequency of Purchase, $Basket Size, Relationship Length, CLTV, Earlier Retention, #Promo Inclusion have a strong presence in determining the chance of customers’ churning. Interestingly, one can identify the influential variable list even with any standard AIML process. But the main issue of that approach is that those approaches might make each of these influential factors look like the levers that can be used to alter the churn rate without really understanding which of them are the causal factors and which of them are some intermediate variables.

Note that the variables CLTV, #Promo Inclusion & Earlier Retention were observed to have some kind of direct impact on Churn and can be tested to see whether they can be the potential causal factors. Thus, DAG helps us in forming a hypothesis about probable causal factor(s) from the set of influential ones and eliminating the variables with little impact from the race.

Thus, our hypothesis for the use cases will be as follows.

· CLTV is one of the causal factors.

· #Promo Inclusion is one of the causal factors.

· Earlier Retention is one of the causal factors.

which need to be validated using Bayesian Network models.

3. Process data with Discretization

The data needs to be discretized (or encoded) by relevant label encoding to be utilized by our tool. For this purpose, we have created a discretization process based on the data type that can be utilized to create the label encoding for each of the variables in the data set. The code snippet for the same has been illustrated below.

4. Validate Causality with Conditional Probability

The variables hypothesized to be having causal relation with the target variable (based on the DAG) needs to be validated based on the quantified intent of the relationship obtained by fitting the discretized data to the Bayesian network. The activity results in the generation of conditional probabilities for the event and non-event that can be utilized to calculate the odds ratio (a measure of association between exposure and an outcome) and the log odds value. This quantified information helps us in identifying the true extent of the relationship and finalizing variables that can be termed as causation agent or causal factors to the event.

The afore-mentioned activity of fitting the discretized data to Bayesian Network and finding out the conditional probabilities was done using the code as illustrated below.

The code provides the levels of hypothesized variables along with the conditional probabilities, Odds ratio & Log Odds values that can be tabulated as below:

From the table, we can observe that customers who were having High CLTV values (CLTV: High) and included in Promotional activities often (#Promo_Inclusion: High) and made part of earlier retention program (Earlier_Retention: Yes) had least chance to churn. Moreover, customers who were included in promotional activities, but had low CLTV values or were not included in earlier retention program had significantly higher propensity to churn. Furthermore, it was observed that customers who were not included in promotion sufficiently also exhibited higher propensity to discontinue shopping unless their CLTV are extremely high. Thus, we can safely infer that the hypothesized variables (CLTV, #Promo Inclusion, Earlier Retention) have causal relationship with the target variable (Churn).

5. Outputs — Marginal Distributions:

The conditional probabilities obtained from the Bayesian Network serve as the basis for all the outputs from the Causal models that can be utilized to understand the relationship better. However, the conditional probabilities can be modified to understand the relationship at different stages/levels or at a concoction of stages/levels of causal factors. This extended capability has significance in understanding the single and pairwise/group-wise joint impact of each of the causal factors into the event and lay the foundation for the process optimization through simulation.

Marginal distributions can be obtained using the below code snippet.

The code results in providing the Probability of a customer churning in case the said customer have high CLTV value and were included in Promotional activities frequently.

From the marginal distribution, it can be inferred that CLTV & Promo Inclusion have higher impact on customer churning since the customer with high CLTV and frequent promotional inclusion still have lower chance of churning.

6. Outputs — What-If Analysis:

Based on the conditional & marginal probabilities estimated earlier, one can identify the sweet spot that can optimize the event of interest without a drastic change in the input levers of causal factors i.e. one can optimize the process using simulation of scenarios. The simulator will help the business stakeholders to take more informed decisions and make their processes smart. For our customer churn use case, the business will have more detailed information about how their decision of including a customer for promotion can affect the chance of the customer continuing shopping with them. Or having a customer in earlier retention programs can help them not lose that customer.

The code snippet for What-if Analysis have been illustrated below.

The causal variables finalized in the earlier stages can be utilized to create a model to assign propensity to churn scores to the customers. This will enable the business to create plans to retain them by modifying the right business levers (#Promo_Inclusion, in our case). The performance of this model can thus be showcased through AUC scores & TPR vs FPR graphs.

The code snippet for the same have been illustrated below.

From the above charts and metrics, it is evident that the model created by the causal factors provides an outstanding result with an AUC value over 0.9 [17]. Note that AUC-ROC is one of the most important evaluation metrics for checking any classification model’s performance and is often written as AUROC (Area Under the Receiver Operating Characteristics). The receiver operating characteristic curve (ROC) is a graphical plot of TPR against FPR at various threshold settings. AUC (Area Under The Curve) represents the degree of separability that the model has achieved in distinguishing between churners and non-churners.

Way Forward

Wipro’s Causal Exploration Grid (CEG) provides root cause identification, by adding an explanation of causal flow to the event of interest, to the standard AIML modeling approach that helps us achieve Next-Level Intelligence. It aims to describe the underlying relationships among variables which gives us an intrinsic explainability of the process/event thereby improving model performance. CEG can respond to the afore-mentioned question about customer churning by pointing out the causal factors — Customer Lifetime Value, No. of Promotion Inclusion and Early Retention. Out of these, high Promotional inclusivity and Early Retention program can be initiated by the retailer to improve upon the chance of retaining the customers. Thus, the solutions prescribed by Causal Exploration Grid provides an unbiased, fair, and actionable solution and can enhance the Decision-making process by inclusion of Causal factors over the correlated ones. CEG has a widespread application base since it can be utilized for any business use cases wherever there is a need of incorporating smarter adjustments in lever-based business decision making.

References

[1] J. Wiles, “What’s New in Artificial Intelligence from the 2022 Gartner Hype Cycle,” 15 September 2022. [Online]. Available: https://www.gartner.com/en/articles/what-s-new-in-artificial-intelligence-from-the-2022-gartner-hype-cycle.

[2] J. Pearl, An Introduction to Causal Inference, CreateSpace Independent Publishing Platform, 2015.

[3] D. Rubin, “Estimating causal effects of treatments on randomized and non-randomized studies,” Journal of Educational Psychology, no. 66, pp. 688–701, 1974.

[4] D. Rubin, “Bayesian inference for causal effects: the role of randomization,” Annals of Statistics, no. 6, pp. 34–58, 1978.

[5] P. Holland, “Statistics and causal inference,” Journal of the American Statistical Association, no. 81, pp. 945–960, 1986.

[6] J. Robins, “A new approach to causal inference in mortality studies with sustained exposure periods — application to control of the healthy worker survivor effect,” Mathematical Modelling, no. 7, p. 1393–1512, 1986.

[7] P. Sprites, C. Glymour and R. Scheines, Causation, Prediction and Search, New York: Springer-Verlag, 1993.

[8] J. Pearl, “Causal diagrams for empirical research,” Biometrika, no. 82, p. 669–710, 1995.

[9] J. Pearl, Causality, Cambridge University Press, 2000.

[10] F. Dominici, F. J. Bargagli-Stoffi and F. Mealli, “From Controlled to Undisciplined Data: Estimating Causal Effects in the Era of Data Science Using a Potential Outcome Framework,” Harvard Data Science Review, vol. 3, no. 3, 2021.

[11] M. Molina and F. Rowland, “Stratospheric Sink for Chlorofluoromethanes: Chlorine Atom-Catalysed Destruction of Ozone,” Nature, no. 249, pp. 810–812, 1974.

[12] E. Sober, “The principle of the common cause,” Probability and Causation: Essays in Honor of Wesley Salmon, vol. J. Fetzer, p. 211– 229, 1987.

[13] X. Zheng, B. Aragam, P. Ravikumar and E. P. Xing, “DAGs with NO TEARS: Continuous Optimization for Structure Learning,” arXiv, 2018.

[14] P. Beaumont and B. Horsburgh, “What is CausalNex?,” [Online]. Available: https://causalnex.readthedocs.io/en/0.4.2/05_resources/05_faq.html.

[15] K. Acquah, “Structural Causal Models,” 27 May 2020. [Online]. Available: https://www.causalflows.com/structural-causal-models/. [Accessed 03 Feb 2023].

[16] P. Beaumont and B. Horsburgh, “CausalNex User Guide,” [Online]. Available: https://causalnex.readthedocs.io/en/0.4.2/04_user_guide/04_user_guide.html.

[17] D. Hosmer and S. Lemeshow, “Chapter 5,” in Applied Logistic Regression, New York, NY, John Wiley and Sons, 2000, pp. 160–164.