Professor Mahdi Ahmadi
Department of ITDS
Phone: 940.565.2946
Office: BLB 331C
Email: Mahdi.Ahmadi@unt.edu
Problem statement (R01):
Diamonds are used and sold in high-end jewelry and other luxury commodities. Being able to put the right price on a diamond requires years of business and gemology experience. In this project you are developing a predictive tool that is helping a diamond trading company to put the right price on diamonds. The dataset that you build the model on contains sales records of almost 54 thousands diamonds. You create a predictive model that uses diamond physical characteristics to predict “price” in US dollars.
Number of columns: 10
Number of records: 53940
Target variable: price
Suggested data mining techniques/tasks: Exploratory data analysis, regression modeling
References: 1
Download data set and data dictionary:
Problem statement (R02):
In this project you are helping a real estate company to estimate house prices in Miami, Florida. The data set contains information on almost 14 thousands single-family homes sold in Miami. Create a model to predict the sale price (SALE_PRC) of the house with a given set of characteristics.
Number of columns: 17
Number of records: 13932
Target variable: SALE_PRC
Suggested data mining techniques and processes: Regression modeling, Exploratory data analysis
References: 1
Download data set and data dictionary:
Problem statement (R03):
School managers and teachers can plan and perform better for the education of students if they can identify students who are at higher risk of failure. In addition to student performance prediction, understanding and quantifying factors that influence their outcome is instrumental in helping low-performing students before it is too late. In this project you are helping a school system in Portugal to predict students grades in mathematics and the Portuguese language courses. For each course there is a separate data set, and each data set includes information on students' demographics, educational and social activities, alcohol consumption, their grades for the first and second period of the term, and their final grade.
You are asked to build two models: one for predicting students' final grade in math, and another for predicting their final grade in the Portuguese language course. Like several other countries a 20-point grading scale is used, where 0 is the lowest grade and 20 is the perfect score. Although the scores can be discretized into a categorical variable and develop a classification model for prediction, in this project, you are asked to keep the grades in the original numeric scale. For this project you do not need to build an explanatory model, but the goal is to create an accurate predictive model. Read the reference article for more details.
Number of columns: 33
Number of records: Math course data set: 395, and the Portugueses language course data set: 649
Target variable: G3 (in both math and the Portugueses language course data set)
Note: G1 or G2 can also be selected as the target variable, however, if G1 is the target variable both G2 and G3 should be dropped from the analysis because the second midterm and final exam grades would not be known before the first midterm exam. Similarly, if G2 is selected to be the target, G3 should not be included in the model.
Suggested data mining techniques and processes: Exploratory data analysis, Regression modeling
Download data set and data dictionary:
Problem statement (R04):
Predicting sales has always been a top business intelligence priority for retail companies. In this project your recruited to perform sale prediction for Walmart. The data sets contains weekly sales information of 45 Walmart stores between the year 2010 and 2012. The data set also includes holiday flag, average weekly temperature, average fuel price, consumer purchase index (CPI), and unemployment rate.
Number of columns: 8
Number of records: 6435
Target variable: Weekly_Sales
Suggested data mining techniques and processes: Exploratory data analysis, Regression modeling
References: 1
Download data set and data dictionary:
Problem statement (R05):
Having a reliable insight into sales volume is very important for retailers. Data mining predictive models can be utilized to predict sales from the product and store characteristics data. In this problem the data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store. Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales. Your role is to build a model to predict sales for a given set of input data.
Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.
Number of columns: 12
Number of records: 14204
Target variable: Item_Outlet_Sales
Suggested data mining techniques and processes: Exploratory data analysis, Regression modeling
Download data set and data dictionary:
Problem statement (R06):
Predicting the sale price of used cars is of great significance to both buyers and sellers in the automotive business. Sellers, whether dealerships or individuals, can establish competitive prices for their listings, resulting in faster sales. Price projections can help buyers make informed judgments and negotiate fair prices. This approach can also help with market analysis, allowing dealerships to change their pricing strategies based on market demand and supply.
The goal of this project is to build a regression model to predict the sale price of used cars based on various features, including the car's manufacturer, model, production year, category, interior details, fuel type, engine volume, mileage, cylinders, transmission type, drive system, wheel position, exterior color, and the number of airbags. The goal is to provide accurate price estimates for used vehicles listed for sale.
Number of columns: 17
Number of records: 13,418
Target variable: Price.
Suggested data mining techniques and processes: Exploratory data analysis, regression.
References: 1
Download data set and data dictionary:
Problem statement (R07):
The insurance industry may face several challenges such as inconsistent and potentially unfair pricing, risk assessment complexities, a lack of transparency in premium calculations, financial risk for insurers, and a competitive disadvantage. A predictive model is essential as it enables fair and consistent premium pricing tailored to individual risk factors, enhances transparency, ensures financial stability, and provides a competitive edge in the market. The data is collected for an Indian medical insurance firm and it includes almost a thousand records.
The goal of this project is to create a regression model using the available data to predict the annual insurance premium price for individuals. Insurance firms may optimize their premium pricing process by building an accurate model, ensuring that policy premiums correlate with individual health and medical profiles. This goal is to improve pricing transparency, provide reasonable insurance premiums, and empower individuals to make educated decisions about their insurance coverage while creating trust and efficiency in the insurance system.
Number of columns: 11
Number of records: 986
Target variable: PremiumPrice.
Suggested data mining techniques and processes: Exploratory data analysis, regression.
References: 1
Download data set and data dictionary: