Skip to Main Content

Business Intelligence and Analytics

This guide is designed to help students find business cases to practice data mining techniques.

Audit Fraud Risk

Problem statement (C01): 

With the growth of financial fraud cases, it is becoming more difficult for audit agencies to effectively find fraudulent cases. Audit field works require a significant amount of planning, resources, and time. It would be extremely beneficial for auditors to be able to spend their resources and time on the cases that are more likely to be fraud. Machine learning models can help auditors improve the quality of their filed work by predicting firms that are likely to resort to high-risk practices. 

In this project you are helping an audit company to identify firms that are more likely to practice financial fraud. The company you are helping is an external auditor of government firms of India. You have access to audit data of 776 government firms across different industry sectors. The data includes metrics of financial discreteness and risk scores. Your task is to develop a model to predict cases with high fraud risk. In the data set, companies audit fraud risk class is shown under RISK.

Number of columns: 10 

Number of records: 776

Target variable: RISK -- A binary variable: 1 = Audit fraud risk, 0 = No audit fraud risk

Suggested data mining techniques and processes: Exploratory data analysis, classification

References: 1, 2

Download data set and data dictionary:

Credit Card Customer Attrition Reduction

Problem statement (C02):

Customer attrition (aka customer churn, defection, or turnover) is defined as the loss of customers by business over time. While customer attrition is a normal part of the customer life cycle, it is a key indicator of business health. Conventionally, it is believed that acquiring new customers is significantly more expensive than preventing the existing customer from churning out. Therefore, reducing customer attrition rate is one of the important objectives of many businesses. 

In this project you are helping a credit card business manager to reduce customer attrition by predicting customers who are likely to drop off. You use customer data to create a classification model to predict which customer might be churned. The business manager could proactively provide those customers with better services to turn their decisions in the opposite direction. Customer data includes some demographic information (gender, age, education level, and marital status), activity information, and credit information. The attrition status of a customer is indicated by the binary variable “Attrition_Flag”. 

Number of columns: 21 

Number of records: 10128 

Target variable:  Attrition_Flag -- A binary variable: 1= Attrited Customer and 0= Not Attrited Customer (Existing Customer). 

Suggested data mining techniques and processes: Exploratory data analysis, classification, clustering 

References: 1, 2

Download data set and data dictionary:

Delivery Customer Services Improvement

Problem statement (C03):

Delivering the purchased items to the customers on time is a key factor in customer satisfaction for e-commerce businesses. In this project you are helping an e-commerce business improve their shipment process by reducing the number of shipments not delivered at the promised time. You create a model to predict what shipment is likely to not reach customers on time. The dataset contains 10 columns describing customer and shipment attributes and a target column (Reached_on_Time) that is a flag indicating if the shipment was delivered on time or not. Predicting this flag would help the business to manage customer relationships more efficiently and reduce similar events in the future. 

Number of columns: 12 

Number of records: 10999 

Target variable: Reached_on_Time -- A binary variable: 1= the product has NOT reached on time, and 0= the product has reached on time. 

Suggested data mining techniques and processes: Classification, Exploratory data analysis 

References: 1

Download data set and data dictionary:

Direct Bank Marketing

Problem statement (C04):

Direct marketing of goods and services to the rigorous selection of customers/contacts can be an effective and affordable method of marketing. Selection of contacts can be enhanced by the application of business intelligence and data mining techniques. The data in this project is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on direct phone calls to convince customers to open a deposit account. Often, more than one contact with the same client was required, to find out if the product (bank term deposit) would be (or not) subscribed. 

Your role is to help the bank in their next direct marketing campaign by building a predictive model based on the provided data. You build a classification model to predict the outcome of the marketing effort, Deposit_Subscription, for each customer based on their demographics and banking data. 

Number of columns: 34 

Number of records: 45211 

Target variable: Deposit_Subscription -- 1= customer has subscribed to a term deposit, 0= customer has not subscribed to a term deposit 

Suggested data mining techniques and processes: Classification, Exploratory data analysis 

References: 1, 2

Download data set and data dictionary:

Hotel Cancellation Management

Problem statement (C05):

Hotel business management requires a reliable insights about cancellation patterns of the customers. In this project you are working on two Portuguese hotel booking and cancellation data sets. One of the hotels (H1) is a resort hotel and the other is a city hotel (H2). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Your task is to build a model for each hotel to predict cancellation events. The booking cancellation predictive model would help the business managers to plan and allocate the resources more effectively. Read the reference article for more details. 

Number of columns: 32 

Number of records: City hotel data (H1): 40,060 and Resort hotel data (H2): 79,330 

Target variable: is_canceled 

Suggested data mining techniques and processes: Exploratory data analysis, classification 

References: 1, 2

Download data set and data dictionary:

Online Users Purchasing Behavior

Problem statement (C06): 

For e-commerce and internet enterprises, income generation from website users is critical. It serves as the foundation for increasing revenue, allocating resources more efficiently, and improving the user experience. Businesses may modify their plans, distribute funds effectively, and personalize user encounters by identifying the elements that determine whether a visitor makes a purchase or not. Furthermore, revenue forecast facilitates anomaly detection, assists in the identification of difficulties or fraud, and aids long-term corporate growth by informing strategic decisions.

The goal of this project is to build a classification model that can accurately predict whether an e-commerce web user will make a purchase or not, based on the provided features. This model can help the website's administrators and marketing teams better understand user behavior and tailor their strategies to increase revenue generation.

Number of columns: 18 

Number of records: 12330

Target variable: Revenue -- A binary variable: TRUE = Completed transaction, FALSE = non-transaction

Suggested data mining techniques and processes: Exploratory data analysis, classification

References: 1, 2

Download data set and data dictionary:

Predicting Brain Stroke

Problem statement (C07): 

Brain stroke is a major health risk event and being able to predict it based on patient data may allow for early intervention and preventive actions to reduce the devastating effects of strokes. Healthcare providers can identify patients at higher risk and modify their care plans accordingly by using patient demographics and health indicators such as hypertension, heart disease, and lifestyle variables. The prediction approach improves both individual and public health while increasing the efficiency of healthcare institutions.

The goal of this project is to develop a classification model that can accurately predict whether a patient is likely to experience a stroke using the provided features. This predictive model can aid healthcare professionals in identifying individuals at higher risk of stroke, enabling early intervention and preventive measures to reduce the incidence of strokes and improve overall patient health outcomes.

Number of columns: 11 

Number of records: 4981

Target variable: Stroke -- A binary variable: 1 = Stroke Risk and 0 = No stroke risk

Suggested data mining techniques and processes: Exploratory data analysis, classification

References: 1, 2

Download data set and data dictionary:

Telco Customer Churn

Problem statement (C08): 

Customer churn forecasting has significance for the telecom industry because it has a direct impact on revenue, profit, and financial sustainability. By identifying customers who are likely to leave based on various customer attributes and service-related features, such as contract type, payment method, monthly charges, and service usage, telecom businesses can take proactive actions to lower churn rates, such as targeted offers, improved customer service, or customized contract choices. Reduced turnover means more income, happier customers, and long-term success. Furthermore, retaining existing consumers is less expensive than acquiring new clients.

The purpose of this project is to create an accurate churn prediction model that can help a telecom service business identify consumers who are likely to churn. Churn prediction categorizes consumers into two groups: those who are likely to abandon the service (Churn = Yes) and those who are likely to remain (Churn = No).

Number of columns: 21 

Number of records: 7043

Target variable: Churn -- A binary variable: yes = customer doesn’t renew the service, No = customer stays with the service.

Suggested data mining techniques and processes: Exploratory data analysis, classification

References: 1, 2, 3

Download data set and data dictionary:

Additional Links

top