This project analyzes a credit risk dataset using SQL. The goal is to practice foundational SQL while answering business-style questions about loan defaults, borrower profiles, loan grades, interest rates, and loan purpose.
The dataset used in this project is credit_risk_dataset.csv.
It contains:
- 32,581 records
- 12 columns
- borrower information such as age, income, home ownership, employment length, loan amount, loan interest rate, loan grade, loan purpose, default status, and credit history length
For the SQL queries, the CSV should be imported into a table named:
credit_riskIf your database uses a different table name, replace credit_risk in the queries with your actual table name.
| Column | Description |
|---|---|
person_age |
Borrower age |
person_income |
Borrower annual income |
person_home_ownership |
Borrower home ownership status |
person_emp_length |
Employment length in years |
loan_intent |
Purpose of the loan |
loan_grade |
Loan grade assigned to the loan |
loan_amnt |
Loan amount |
loan_int_rate |
Loan interest rate |
loan_status |
Loan status, where 1 indicates default and 0 indicates non-default |
loan_percent_income |
Loan amount as a percentage of income |
cb_person_default_on_file |
Whether the borrower has a previous default on file |
cb_person_cred_hist_length |
Credit history length in years |
This analysis answers questions such as:
- How many loans are in the portfolio?
- What percentage of loans defaulted?
- Which loan grades have the highest default rates?
- Which loan purposes are more associated with default?
- How does income differ between defaulted and non-defaulted loans?
- How do home ownership and previous default history relate to loan status?
- How can borrowers be grouped using simple
CASE WHENstatements? - Are there missing values that should be reviewed before deeper analysis?
From the dataset review:
- Total records: 32,581
- Overall default rate: 21.82%
- Average income: $66,074.85
- Average loan amount: $9,589.37
- Average interest rate: 11.01%
- Missing values exist in
person_emp_lengthandloan_int_rate, so these fields should be reviewed before deeper analysis.
SELECTWHEREGROUP BYORDER BYCOUNT,SUM,AVG,MIN,MAX- simple percentage calculations
- simple
CASE WHENstatements - basic data quality checks
Credit-Risk-SQL-Analysis/
├── README.md
├── data/
│ └── credit_risk_dataset.csv
├── sql/
│ └── Credit_Risk_Analysis.sql
- Create a table named
credit_riskin your SQL database. - Import the file
credit_risk_dataset.csvinto that table. - Open
sql/Credit_Risk_Analysis.sql. - Run each query one by one.
- Review the comments above each query to understand the business question.
This project is not intended to be a predictive model. It is a SQL analysis project focused on understanding borrower and loan patterns related to default risk.
Future improvements could include:
- creating a dashboard from the SQL outputs
- adding charts for default rate by loan grade and loan intent
- using Python to clean missing values and validate data quality
- expanding the project into a basic machine learning model after the SQL analysis is fully understood