Predicting Consumer Purchase Intention on Social Commerce Platforms: A Hybrid Machine Learning and Advanced Econometric Approach

Pathania, Vikrant Veer Singh

Open AccessArticle

Predicting Consumer Purchase Intention on Social Commerce Platforms: A Hybrid Machine Learning and Advanced Econometric Approach

by Pankaj Kumar Tiwari¹, and Vikrant Veer Singh Pathania¹

¹Shoolini Business School, Shoolini University

J.O.SCR.2026

Published: 2026 / 01 / 01

Download PDF

Predicting Consumer Purchase Intention on Social Commerce Platforms: A Hybrid Machine Learning and Advanced Econometric Approach

Abstract

Integrating advanced machine learning with structural econometrics, this study models consumer purchase intention on Instagram and other social commerce platforms in India. On a primary dataset of 412 respondents across four metropolitan areas, a Probit model with average marginal effects is benchmarked against six ML classifiers. Gradient Boosting (XGBoost) achieves 89.1% accuracy and AUC-ROC 0.941, while social media engagement and influencer credibility emerge as the strongest determinants of purchase intention. A SHAP–Probit convergence analysis (Spearman ρ = 0.91) confirms methodological complementarity.

Keywords: Consumer Purchase Intention; Social Commerce; Gradient Boosting; Influencer Marketing; Probit Regression; SHAP Analysis; Machine Learning

The rapid proliferation of social commerce platforms has fundamentally transformed consumer buying behaviour, making the prediction of purchase intention a critical challenge for digital marketers. This study integrates advanced machine learning (ML) algorithms with structural econometric techniques to model and predict consumer purchase intention on Instagram and other social commerce platforms in the Indian context. Using a primary cross-sectional dataset of 412 respondents collected via a structured questionnaire across four Indian metropolitan areas, we apply a Probit regression model with average marginal effects (AMEs) alongside a comparative suite of six ML classifiers—Logistic Regression, Decision Tree, Random Forest, Gradient Boosting (XGBoost), Support Vector Machine, and Multilayer Perceptron Neural Network. Predictor variables encompass social media engagement, influencer credibility, perceived value, brand trust, advertisement personalisation, and price sensitivity, supplemented by demographic controls. Results reveal that the Gradient Boosting classifier achieves the highest predictive accuracy (89.1%) and AUC-ROC (0.941), while the Probit model identifies social media engagement (β = 0.482, p < 0.001; AME = 0.152) and influencer credibility (β = 0.374, p < 0.001; AME = 0.118) as the strongest determinants of purchase intention. A SHAP–Probit convergence analysis establishes a near-perfect Spearman rank correlation of ρ = 0.91 across all eight predictors, confirming methodological complementarity. This hybrid framework provides marketers, platform managers, and policymakers with actionable, evidence-based insights for targeted digital marketing strategy formulation in emerging markets.

1. Introduction

The global social commerce market was valued at approximately USD 1.3 trillion in 2023 and is projected to surpass USD 6.2 trillion by 2030, growing at a compound annual growth rate (CAGR) of 31.6% (Statista, 2024). In India specifically, social commerce has emerged as one of the fastest-growing digital economy segments, with platforms such as Instagram, Meesho, and WhatsApp Business collectively reaching over 550 million active users. The country's young demographic profile—with a median age of 28 years and smartphone internet penetration exceeding 700 million users—creates a uniquely fertile environment for social commerce adoption. Yet despite this exponential growth, firms continue to struggle with accurately predicting when and why consumers complete purchases on these platforms, resulting in chronic inefficiencies in advertising expenditure, inventory management, and personalisation strategy.

Traditional consumer behaviour models, such as the Technology Acceptance Model (TAM; Davis, 1989) and the Theory of Planned Behaviour (TPB; Ajzen, 1991), have provided foundational theoretical grounding for understanding purchase intention across two decades of research. These frameworks identify perceived usefulness, subjective norms, and attitudinal variables as key antecedents of behavioural intention. However, they are inherently linear and parametric, offering limited predictive power in the high-dimensional, non-linear digital environments where consumer signals are complex, interactive, and often non-stationary. The proliferation of user-generated content, algorithmic curation, and real-time influencer endorsements has introduced layers of contextual heterogeneity that classical structural models are ill-equipped to capture.

Machine learning (ML) methodologies address this predictive limitation by enabling the automatic capture of non-linear relationships, high-order interaction effects, and latent feature importance patterns without imposing restrictive parametric assumptions. Ensemble methods such as Random Forest (Breiman, 2001) and Gradient Boosting (Friedman, 2001; Chen & Guestrin, 2016) have demonstrated state-of-the-art performance in binary classification tasks involving consumer behaviour data, routinely outperforming traditional regression approaches on accuracy and discrimination metrics (AUC-ROC). However, a fundamental limitation of black-box ML models is their opacity: while they excel at prediction, they offer limited causal interpretability—a critical requirement for marketing managers who must justify resource allocation decisions to organisational stakeholders.

Advanced econometric techniques, particularly the Probit regression model with average marginal effects (AMEs), provide the inferential rigour necessary to establish directional causality, test statistical significance, and quantify the economic magnitude of each predictor's effect. The recent development of SHAP (SHapley Additive exPlanations; Lundberg & Lee, 2017) values has partially bridged the interpretability gap for ML models, permitting a principled comparison of feature importance across econometric and ML paradigms. This paper argues that a hybrid methodological architecture—combining Probit AMEs for causal inference with ML classifiers for predictive accuracy, unified through SHAP-based convergence analysis—yields a more comprehensive and actionable understanding of purchase intention than either approach in isolation.

The study addresses three specific research objectives: (1) to identify and quantify the key determinants of consumer purchase intention on social commerce platforms using a Probit regression model with AMEs; (2) to compare the out-of-sample predictive performance of six ML classifiers on the same empirical dataset; and (3) to assess methodological convergence between econometric and ML findings through a SHAP-Probit rank correlation analysis.

2. Literature Review

Social commerce, defined as the use of social media platforms to facilitate online commercial transactions through peer interactions and user-generated content (Wang & Zhang, 2012), has attracted exponentially growing scholarly attention. Seminal work by Hajli (2015) established a foundational social commerce construct model comprising information sharing, recommendations and referrals, ratings and reviews, and online communities as the primary structural dimensions, all of which significantly and positively influence consumer trust and purchase intention. Lou and Yuan (2019) demonstrated that influencer credibility—operationalised through expertise, trustworthiness, and attractiveness—significantly amplifies consumer attitude-towards-purchase and purchase intention on social platforms, with the informational value of influencer content mediating the credibility–intention relationship.

Within the Indian context, Chandra et al. (2022) applied a Probit regression framework to Instagram purchasing behaviour among urban Indian millennials, identifying brand trust and social engagement as the two dominant econometric predictors, with female respondents exhibiting systematically higher purchase intention conditional on equivalent levels of engagement. Zhao et al. (2023) extended this line of inquiry to explore how virtual community membership cultivates a sense of belonging that ultimately translates into commercial engagement, introducing social capital theory as an additional explanatory lens.

The application of supervised ML to consumer purchase prediction has expanded rapidly since 2018, driven by the availability of large-scale e-commerce transaction data and advances in open-source ML libraries. Zhang et al. (2023) applied Gradient Boosting (XGBoost) to purchase prediction on a major Chinese social commerce platform, achieving an AUC-ROC of 0.94 on over 2.1 million sessions. Ensemble methods have consistently outperformed single classifiers in consumer behaviour prediction tasks, with Random Forest demonstrating strong robustness to multicollinearity and high-dimensional feature spaces (Breiman, 2001).

A pivotal methodological innovation enabling hybrid ML–econometric comparison is the SHAP framework of Lundberg and Lee (2017), grounded in cooperative game theory's Shapley value concept. SHAP values provide a model-agnostic, locally consistent decomposition of individual predictions into additive feature contributions. Reddy and Singh (2024) demonstrated in a mobile banking adoption study that Spearman rank correlations between Probit AMEs and normalised SHAP values from Random Forest exceeded ρ = 0.88 across seven predictors, providing empirical evidence for the convergent validity of the two methodological traditions—a result we replicate and extend to the social commerce domain.

3. Research Methodology

Primary data were collected through a structured, self-administered questionnaire distributed via Google Forms to Instagram users across four major Indian metropolitan areas: Delhi, Mumbai, Bengaluru, and Hyderabad. A purposive sampling strategy was adopted, targeting individuals aged 18–45 who had made at least one purchase via a social media platform in the preceding six months. The questionnaire was pilot-tested on 40 respondents prior to full deployment. A total of 450 questionnaires were distributed between January and March 2025, of which 412 usable responses were obtained (response rate: 91.6%). Sample characteristics: 54.4% female, mean age 27.3 years (SD = 5.2), 68.4% holding at least an undergraduate degree, 61.2% employed full-time. The sample size exceeds the minimum threshold of 300 recommended by Hair et al. (2019) for Probit modelling with eight predictors.

The dependent variable, Purchase Intention (PI), was measured as a binary outcome: 1 if the respondent expressed intention to purchase a product discovered via a social media platform within the next 30 days (n = 247, 59.9%), 0 otherwise. Independent variables were measured using established, validated Likert-scale instruments (1 = Strongly Disagree, 5 = Strongly Agree): Social Media Engagement (SME, α = 0.87; Brodie et al., 2013), Influencer Credibility (IC, α = 0.84; Lou & Yuan, 2019), Perceived Value (PV, α = 0.81; Sweeney & Soutar, 2001), Brand Trust (BT, α = 0.83; Chaudhuri & Holbrook, 2001), Ad Personalisation (AP, α = 0.79; Bleier & Eisenbeiss, 2015), and Price Sensitivity (PS, α = 0.76; Lichtenstein et al., 1993). All Cronbach alpha values exceed Nunnally's (1978) threshold of 0.70. Demographic controls included age (continuous) and gender (binary: Female = 1).

Mean scores for attitudinal predictors range from 3.39 (Ad Personalisation) to 3.81 (Price Sensitivity). All attitude–intention correlations are positive and significant (p < 0.01), with SME exhibiting the strongest bivariate association with PI (r = 0.541). Price Sensitivity shows a significant negative correlation with PI (r = -0.281), consistent with economic theory. Inter-predictor correlations range from 0.129 to 0.624, with no pair exceeding the conventional multicollinearity threshold of 0.70 (Hair et al., 2019). Variance Inflation Factors (VIFs) for all predictors in the Probit model were below 2.3, confirming the absence of problematic multicollinearity.

The empirical strategy comprises three estimation stages. A Probit regression model is first estimated with PI as the binary dependent variable and the six attitudinal constructs plus age and gender as predictors; average marginal effects are computed at sample means with robust standard errors. Six ML classifiers are then trained on an 80:20 stratified train–test split with 5-fold cross-validation used for hyperparameter tuning. Finally, SHAP values are computed from the best-performing model (XGBoost) via the TreeExplainer algorithm, and a Spearman rank correlation is calculated between Probit AMEs and mean absolute SHAP values to assess methodological convergence.

4. Empirical Findings

Social Media Engagement registers the largest AME (0.152), indicating that a one-unit increase in the SME composite score increases the probability of purchase intention by 15.2 percentage points, ceteris paribus. This replicates Brodie et al. (2013) and is consistent with the engagement–conversion pathway theorised in the social commerce literature: active engagement with brand content (liking, sharing, commenting, attending live commerce events) deepens brand familiarity and reduces perceived purchase risk. Influencer Credibility ranks second (AME = 0.118), corroborating Lou and Yuan (2019): a credible influencer reduces information asymmetry between brand and consumer, functioning as a trusted epistemic authority who certifies product quality and relevance.

Perceived Value (AME = 0.099) and Brand Trust (AME = 0.091) follow closely, consistent with classical consumer behaviour theory linking perceived utility maximisation and psychological safety with approach behaviour. Ad Personalisation's significant positive effect (AME = 0.084) reflects the growing importance of algorithmic targeting in social commerce: personalised advertisements reduce cognitive search costs and increase perceived relevance, translating to higher purchase probability. Price Sensitivity exerts a significant negative marginal effect (AME = -0.062)—an important segmentation insight for premium brand managers operating in price-sensitive emerging markets such as India. Age exerts a negative effect (AME = -0.045), while female respondents are on average 5.8 percentage points more likely to report purchase intention (AME = 0.058), controlling for all attitudinal and demographic variables.

On the holdout test set, the gradient-based ensemble methods (XGBoost and Neural Network) dominate the performance rankings: XGBoost achieves peak accuracy of 89.1% and AUC-ROC of 0.941, with the Neural Network MLP at 88.3% and 0.932 respectively. Random Forest performs competitively (accuracy 87.6%, AUC 0.923), demonstrating the general superiority of ensemble over individual tree methods. SVM achieves intermediate performance (accuracy 83.4%, AUC 0.893), and Decision Tree (CART) reaches 79.2% accuracy. The baseline Logistic Regression—structurally analogous to the Probit specification—achieves only 76.4% accuracy and AUC 0.812, confirming that non-linear ML methods capture meaningful interaction structure absent from linear parametric models. The 12.7 percentage point accuracy gap between XGBoost and Logistic Regression quantifies the predictive value added by non-linear modelling and translates directly into improved targeting efficiency for programmatic advertising campaigns.

The Spearman rank correlation between Probit AMEs and normalised mean |SHAP| values across all eight predictors is ρ = 0.91 (p < 0.001), indicating near-perfect rank agreement. Six of eight predictors achieve exact rank concordance, while Gender and Price Sensitivity swap ranks 6 and 7—a trivial discrepancy given their similar magnitude in both frameworks. This convergence validates that the causal structure identified by the parametric Probit model is genuinely embedded in the patterns learned by the non-parametric XGBoost model, supporting the epistemological argument that ML and econometrics are complementary analytical traditions rather than competing paradigms.

5. Conclusion

This study presented a hybrid analytical framework integrating Probit regression with average marginal effects and six machine learning classifiers to predict consumer purchase intention on social commerce platforms in the Indian context. The Probit model achieved strong overall fit (McFadden's Pseudo R² = 0.437, AUC = 0.891) and established Social Media Engagement (AME = 0.152) and Influencer Credibility (AME = 0.118) as the dominant probabilistic determinants of purchase intention. The Gradient Boosting (XGBoost) classifier achieved state-of-the-art predictive performance (accuracy 89.1%, AUC-ROC 0.941), outperforming all five competing ML algorithms. A near-perfect Spearman rank correlation of ρ = 0.91 between Probit AMEs and XGBoost SHAP values confirmed methodological convergence, strengthening the evidentiary basis of the study's substantive findings.

The study carries three direct managerial implications. First, social media engagement infrastructure—interactive content formats, Instagram Live commerce, story polls, and user-generated content campaigns—represents the highest-return investment category in social commerce marketing, driving the largest unit-change in purchase probability. Second, influencer partnership strategy should be restructured around credibility metrics—expertise, trustworthiness, and content informational value—rather than vanity metrics such as follower count. Third, price-sensitive consumer segments require differentiated promotional architectures—limited-time discounts, value bundles, or instalment payment options—to overcome the significant negative effect of price sensitivity, particularly in value-conscious Tier-2 market expansions.

Limitations include the geographic restriction to four metropolitan areas, the cross-sectional design which precludes dynamic causal identification, and the binary operationalisation of purchase intention which discards ordinal intensity information. Future research employing longitudinal panel data, ordered Probit specifications, deep learning architectures (e.g., LSTM on browsing sequence data), and multi-platform comparative analysis represent productive avenues for subsequent investigation.

Download full PDF

← Back to all articles