Applications of Big Data Analytics in Tax Compliance Monitoring: A Case Study of Rwanda’s Value-Added Tax

Comparison of models based on different score metrics

	Model	Precision score	Recall score	F1 score	Accuracy	Log loss
1	Logistic Regression	0.892	0.902	0.893	0.894	0.246
2	Random Forest	0.840	0.830	0.834	0.841	0.886
3	Decision Tree	0.843	0.827	0.832	0.841	3.769

	Model	Precision score	Recall score	F1 score	Accuracy	Log loss
1	Logistic Regression	0.892	0.902	0.893	0.894	0.246
2	Random Forest	0.840	0.830	0.834	0.841	0.886
3	Decision Tree	0.843	0.827	0.832	0.841	3.769

Table 1.

Open in new tab Download slide

Comparison of models based on different score metrics

	Model	Precision score	Recall score	F1 score	Accuracy	Log loss
1	Logistic Regression	0.892	0.902	0.893	0.894	0.246
2	Random Forest	0.840	0.830	0.834	0.841	0.886
3	Decision Tree	0.843	0.827	0.832	0.841	3.769

	Model	Precision score	Recall score	F1 score	Accuracy	Log loss
1	Logistic Regression	0.892	0.902	0.893	0.894	0.246
2	Random Forest	0.840	0.830	0.834	0.841	0.886
3	Decision Tree	0.843	0.827	0.832	0.841	3.769

Starting with logistic regression, it exhibits a high precision score of 0.892, indicating a strong ability to accurately predict under-reporting. The recall score of 0.902 underscores its proficiency in capturing a substantial portion of actual positive under-reporting cases. The F1 score of 0.893 strikes a balance between precision and recall, suggesting a well-rounded performance. The model’s overall accuracy of 0.894, coupled with a relative log loss of 0.246, shows the models’ ability to make correct predictions across both classes, indicating good calibration of predicted probabilities.

The random forest model demonstrates competitive performance with a precision score of 0.840, recall score of 0.830, and F1 score of 0.834. The accuracy of 0.841 suggests the model’s overall correctness in classification. While slightly lower than logistic regression, the random forest still maintains robust predictive capabilities. The log loss of 0.886 is higher than that of logistic regression but remains within an acceptable range, implying reasonable confidence in predicted probabilities.

The decision tree model, while generally performing well, exhibits slightly higher precision (0.843), lower recall (0.827), and F1 score (0.832) compared to the other two models. The accuracy of 0.841 implies a reliable overall classification accuracy. However, the significantly higher log loss of 3.769 suggests potential issues with the model’s calibration or probability estimates.

To assess how models perform at different classification thresholds, we use the area under the ROC curve (AUC-ROC) as shown in Figure 1. This curve shows a single value that summarizes the performance of the model across all possible classification thresholds. Although in this case, all the three models seem to perform at a high rate, random forest is a little bit better in this aspect.

Figure 1.

Area under receiver operating characteristic curve (ROC AUC).

In summary, the logistic regression model appears to excel in all evaluation metrics because it has higher values of precision, recall, accuracy, and AUC-ROC. In addition, it has a lowest log loss, making it particularly suitable for this task of detecting and predicting the VAT under-reporting.

4.1 Feature importance

To understand which variables are more influential on the outcome of the model, we measure the weight or contribution of each input variable (feature) in a model’s predictive performance. Understanding feature importance is crucial for interpreting model decisions and identifying the most influential factors to look out for as monitoring the model.

Based on the top 10 features listed in Table 2, the most influential factor in VAT under-reporting is the ‘exports/imports’ indicating that businesses involved in import or export are more likely to under-report their VAT. Following closely is ‘domestic sales’ showing that local businesses are also prone to under-reporting. The ‘age of business’ is highlighted as a significant factor, signifying that businesses with fewer years in operation are more likely to under-report their VAT.

Table 2.

Top 10 features related to VAT under-reporting

	Feature name	Description	Importance
1	Exports/imports	Businesses that import or export	0.3501
2	Domestic sales	Local businesses	0.3042
3	Age of business	Time from registration to audit	0.161
4	Taxpayer type	Individual businesses	0.0195
5	Sector G	Wholesale and retail trade; repair of motor vehicles and motorcycles sector	0.0147
6	Size of business	Small businesses	0.0131
7	Taxpayer type	Non-individual businesses	0.0111
8	Sector F	Construction sector	0.0109
9	Sector H	Transportation and storage sector	0.0109
10	Province of operations	Businesses operating from Kigali city	0.0083

	Feature name	Description	Importance
1	Exports/imports	Businesses that import or export	0.3501
2	Domestic sales	Local businesses	0.3042
3	Age of business	Time from registration to audit	0.161
4	Taxpayer type	Individual businesses	0.0195
5	Sector G	Wholesale and retail trade; repair of motor vehicles and motorcycles sector	0.0147
6	Size of business	Small businesses	0.0131
7	Taxpayer type	Non-individual businesses	0.0111
8	Sector F	Construction sector	0.0109
9	Sector H	Transportation and storage sector	0.0109
10	Province of operations	Businesses operating from Kigali city	0.0083

Table 2.

Top 10 features related to VAT under-reporting

	Feature name	Description	Importance
1	Exports/imports	Businesses that import or export	0.3501
2	Domestic sales	Local businesses	0.3042
3	Age of business	Time from registration to audit	0.161
4	Taxpayer type	Individual businesses	0.0195
5	Sector G	Wholesale and retail trade; repair of motor vehicles and motorcycles sector	0.0147
6	Size of business	Small businesses	0.0131
7	Taxpayer type	Non-individual businesses	0.0111
8	Sector F	Construction sector	0.0109
9	Sector H	Transportation and storage sector	0.0109
10	Province of operations	Businesses operating from Kigali city	0.0083

	Feature name	Description	Importance
1	Exports/imports	Businesses that import or export	0.3501
2	Domestic sales	Local businesses	0.3042
3	Age of business	Time from registration to audit	0.161
4	Taxpayer type	Individual businesses	0.0195
5	Sector G	Wholesale and retail trade; repair of motor vehicles and motorcycles sector	0.0147
6	Size of business	Small businesses	0.0131
7	Taxpayer type	Non-individual businesses	0.0111
8	Sector F	Construction sector	0.0109
9	Sector H	Transportation and storage sector	0.0109
10	Province of operations	Businesses operating from Kigali city	0.0083

Additionally, the model identifies several other factors associated with VAT under-reporting, including businesses registered as individuals, those in the wholesale and retail trade, repair of motor vehicles and motorcycles sector, small businesses, non-individual businesses, the construction sector, the transportation and storage sector, and businesses operating from Kigali City. These factors play an essential role in influencing business behavior; hence, they should be considered when selecting risk cases for auditing in order to make the audit more time- and cost-efficient.

4.2 Impact of this study and policy implication

From the sample used in this study, covering over 2,260 audited taxpayers on VAT for the period of 2014–2019, the total under-reported amount was around 66.8 billion Rwandan francs. If the RRA employs logistic regression, identified as the best model across all evaluation metrics, it can be expected to identify approximately 89.2% of the cases with a high level of precision. This implies that RRA can save around 59.5 billion Rwandan francs by accurately pinpointing the under-reported cases. Estimating the impact on an annual basis, leveraging this model could potentially save more than 12 billion Rwandan francs per year lost due to the under-reporting of VAT. Moreover, the application of the model goes beyond VAT, making it a valuable tool for detecting various forms of non-compliance and tax fraud, even within large and complex datasets.

Incorporating the tool into the audit selection process should be done gradually in a comparative performance evaluation; in this way, the model is used in parallel with the existing selection process to evaluate its results versus those of the existing process, and the model must be improved based on the audit findings. The gradual integration of the model into the audit selection process ensures a data-driven approach to identifying high-risk taxpayers. This will likely lead to a more targeted and effective use of audit resources, reducing the burden on compliant taxpayers while focusing efforts on areas with the highest potential for non-compliance.

As the model is implemented, it is crucial to establish a policy framework that allows for continuous improvement. This could involve regular reviews of the model’s performance, retraining the model based on audit findings, and the incorporation of new data sources to refine its predictive accuracy. It is also beneficial to consider the legal and ethical implications to ensure that the use of such models complies with existing laws and respects taxpayer rights, while also being transparent about how these tools are used in the audit selection process.

4.3 Limitations

While this study provides valuable insights into the detection of under-reporting factors in VAT data using machine learning models, it is essential to acknowledge certain limitations that may impact the generalizability of the findings. One notable constraint is the reliance on VAT audit data spanning from 2014 to 2019. The absence of more recent data poses a limitation, as business behaviors and under-reporting methods are likely to change over time due to some factors like policies and law amendments. Up-to-date information would offer a more accurate representation of current patterns and behaviors related to under-reporting.

Despite these limitations, the research lays a foundation for understanding and addressing VAT under-reporting challenges using machine learning in tax administration in general, starting in RRA.

5. Conclusion and Future Work

This paper has delved into the critical realm of tax compliance monitoring with the use of advanced machine learning models in the wake of challenges such as persisting under-reporting by a large number of taxpayers resulting in lost revenue. The focus on VAT is indeed justifiable, as VAT contributes a big portion to the overall tax revenues of the country.

The study’s contribution lies in the application of advanced analytical techniques, specifically machine learning models, to detect and predict factors associated with under-reporting in VAT data. The emphasis on time and audit cost efficiency underscores the potential of these models to make audits more efficient by providing auditors with advanced knowledge and patterns for targeted audits of likely under-reported taxpayers.

The findings prove that the models can indeed identify factors associated with underpotting from past audit data, thus being able to use such information in telling tax authorities if a certain taxpayer is likely to be under-reporting on their VAT declaration. Among the models used, logistic regression emerged as particularly noteworthy for its balanced precision, recall, and accuracy, although the random forest model also demonstrated competitive performance, affirming the potential of ensemble learning in the future.

In the future, the authors plan to incorporate this machine learning model into the everyday tools used in VAT reporting monitoring. This will include more testing and, eventually, the deployment of the model in the form of a dashboard that will be accessible to everyone in charge of VAT reporting compliance, including auditors.

Footnotes

1

Logistic regression is a model used mostly for binary classification tasks, where the goal is to predict the probability of an instance belonging to a particular class (Hilbe 2011). The model is used in many domains because it is simple, interpretable, and effective in handling binary outcomes.

2

Decision tree is a machine learning model that is flexible to be used for classification and regression tasks. Decision trees are known for their interpretability and ease of visualization, allowing users to understand the decision-making process intuitively. Pruning is a common strategy used to improve generalization performance because these models have the potential to overfit (Charbuty and Abdulazeez 2021).

3

Random forest is a machine learning model based on decision trees. Random forest is widely appreciated for its ability to handle complex relationships in data, maintain interpretability to some extent, and provide high predictive accuracy. It is a versatile algorithm applicable to various tasks, making it a popular choice in machine learning for both beginners and experts (Hastie et al. 2009).

Conflict of Interest

The authors confirm they have no conflicts related to this research or its publication.

Data availability

The data that has been used is confidential.

References

Acosta-Ormaechea

S.

,

Morozumi

A.

(

2021

), “

The Value-Added Tax and Growth: Design Matters

”,

International Tax and Public Finance

28

,

1211

–

41

.

Advani

A.

,

Elming

W.

,

Shaw

J.

(

2023

), “

The Dynamic Effects of Tax Audits

”,

Review of Economics and Statistics

105

,

545

–

61

.

Alexopoulos

A.

,

Dellaportas

P.

,

Gyoshev

S.

,

Kotsogiannis

C.

,

Olhede

S. C.

,

Pavkov

T.

(

2021

), “Detecting Anomalies in Heterogeneous Population–Scale VAT Networks.” arXiv preprint arXiv:2106.14005.

Atawodi

O. W.

,

Ojeka

S. A.

(

2012

), “

Factors That Affect Tax Compliance among Small and Medium Enterprises (SMEs) in North Central Nigeria

”,

International Journal of Business and Management

7

,

87

–

96

.

Battaglini

M.

,

Guiso

L.

,

Lacava

C.

,

Miller

D. L.

,

Patacchini

E.

(

2024

), “

Refining Public Policies with Machine Learning: The Case of Tax Auditing

”,

Journal of Econometrics

105847

.

Brautigam

D.

,

Fjeldstad

O.-H.

,

Moore

M.

, eds (

2008

),

Taxation and State-Building in Developing Countries: Capacity and Consent

,

Cambridge University Press

,

Cambridge and New York

.

Charbuty

B.

,

Abdulazeez

A.

(

2021

), “

Classification Based on Decision Tree Algorithm for Machine Learning

”,

Journal of Applied Science and Technology Trends

2

,

20

–

8

.

Cobham

A.

(

2005

), “Tax Evasion, Tax Avoidance, and Development Finance”, Queen Elizabeth House Working Paper No. 129, University of Oxford, Oxford.

DeBacker

J.

,

Heim

B. T.

,

Tran

A.

,

Yuskavage

A.

(

2015

), “

Legal Enforcement and Corporate Behavior: An Analysis of Tax Aggressiveness after an Audit

”,

The Journal of Law and Economics

58

,

291

–

324

.

Engida

T. G.

,

Baisa

G. A.

(

2014

), “

Factors Influencing Taxpayers’ Compliance with the Tax System: An Empirical Study in Mekelle City, Ethiopia

”,

eJournal of Tax Research

12

,

433

–

52

.

Ghura

M. D.

(

1998

),

Tax Revenue in Sub-Saharan Africa: Effects of Economic Policies and Corruption

,

International Monetary Fund

,

Washington, DC

.

González

P. C.

,

Velásquez

J. D.

(

2013

), “

Characterization and Detection of Taxpayers with False Invoices Using Data Mining Techniques

”,

Expert Systems with Applications

40

,

1427

–

36

.

Goutte

C.

,

Gaussier

E.

(

2005

), “A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation”, in

Losada

D. E.

,

Fernández-Luna

J. M.

, eds,

Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science

, Vol.

3408

,

Springer

,

Berlin/Heidelberg

, pp.

345

–

59

.

Hastie

T.

,

Tibshirani

R.

,

Friedman

J.

,

Hastie

T.

,

Tibshirani

R.

,

Friedman

J.

(

2009

), “Overview of Supervised Learning”, in

Hastie

T.

,

Tibshirani

R.

,

Friedman

J.

, eds,

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

,

Springer

,

Berlin

, pp.

9

–

41

.

Hilbe

J. M.

(

2011

), “

Logistic Regression

”,

International Encyclopedia of Statistical Science

1

,

15

–

32

.

Kenya Revenue Authority

(

2023

), Reporting Tax Fraud, https://www.kra.go.ke/tax-fraud/types-of-frauds

Kirchler

E.

(

2007

),

The Economic Psychology of Tax Behaviour

,

Cambridge University Press

,

Cambridge

.

Kotsogiannis

C.

,

Salvadori

L.

,

Karangwa

J.

,

Mukamana

T.

(

2024a

), “

Do Tax Audits Have a Dynamic Impact? Evidence from Corporate Income Tax Administrative Data

”,

Journal of Development Economics

170

,

103292

.

Kotsogiannis

C.

,

Salvadori

L.

,

Karangwa

J.

,

Murasi

I.

(

2024b

), “

E-Invoicing, Tax Audits and VAT Compliance

”,

Journal of Development Economics

172

,

103403

.

Mardhiah

M.

,

Miranti

R.

,

Tanton

R.

(

2019

), “The Slippery Slope Framework: Extending the Analysis by Investigating Factors Affecting Trust and Power”, CESifo Working Paper No. 7494, Center for Economic Studies and Ifo Institute (CESifo), Munich.

Mascagni

G.

,

Nell

C.

,

Monkam

N.

(

2017

), “One Size Does Not Fit All: A Field Experiment on the Drivers of Tax Compliance and Delivery Methods in Rwanda”, ICTD Working Paper No. 58, Available at SSRN: https://ssrn.com/abstract=3120363 or

10.2139/ssrn.3120363

.

Munyentwali

G.

(

2015

), “

Factors Affecting Tax Compliance in Rwanda: An Empirical Analysis

”,

Research Journal of Economics

1

,

1

–

23

.

Murorunkwere

B. F.

,

Ihirwe

J. F.

,

Kayijuka

I.

,

Nzabanita

J.

,

Haughton

D.

(

2023

), “

Comparison of Tree-Based Machine Learning Algorithms to Predict Reporting Behavior of Electronic Billing Machines

”,

Information

14

,

140

.

Murorunkwere

B. F.

,

Tuyishimire

O.

,

Haughton

D.

,

Nzabanita

J.

(

2022

), “

Fraud Detection Using Neural Networks: A Case Study of Income Tax

”,

Future Internet

14

,

168

.

, https://www.ktpress.rw/2016/09/25-companies-named-in-rwf-6-8-billion-tax-fraud/

Nyesiga

D.

(

2016

),

25 Companies Named in Rwf 6.8 Billion Tax Fraud

,

KT PRESS

Rwanda Revenue Authority

(

2019

), RRA Tax Handbook, https://www.rra.gov.rw/fileadmin/user_upload/rra_tax_handbook_november_2019.pdf

Rwanda Revenue Authority

(

2020

), Tax Statistics in Rwanda, https://www.rra.gov.rw/fileadmin/user_upload/rra_tax_statistics_4_version_2019-20_official.pdf

Slemrod

J.

,

Yitzhaki

S.

(

2002

), “Tax Avoidance, Evasion, and Administration”, in

Auerbach

A. J.

,

Feldsten

M.

, eds,

Handbook of Public Economics

, Vol.

3

.

Elsevier

,

Amsterdam

, pp.

1423

–

70

.

Tilahun

M.

(

2019

), “

Determinants of Tax Compliance: A Systematic Review

”,

Economics

8

,

1

–

7

.