# House Prices – Hybrid Model for Advanced Regression Techniques
### Integrating Decision Trees and Neural Networks for Price Prediction.
<br>
by **Chueh-an Kuo(郭爵安)**
**Full Report on HackMD:**
https://hackmd.io/@EhY6joeNTVu4Z9GCmlZRxQ/HJpYhy-kgx
**Full Report on GitHub Gist:**
https://gist.github.com/ChuehanKuo/5d25a0770e50c3584ed552af973c406d
**Kaggle Notebook(full code):**
https://www.kaggle.com/code/chuehankuo/decision-tree-from-scratch-with-neural-network
<br>
## Table of Contents
### Project Overview:
#### 1. Abstract
#### 2. Motivation
#### 3. Objective
#### 4. Training Flow
### Machine Learning Concepts:
#### 1. Introduction to Decision Trees
#### 2. Introduction to Neural Networks
#### 3. What is DFS vs BFS?
#### 4. What is MSE?
### Model Architecture:
#### 1. Label Prediction Combined with Neural Network
#### 2. Neural Network in This Project (Leaf Node Model)
### Data Processing & Model Building:
#### 1. Step 1: Data Preprocessing
#### 2. Step 2: Build a Decision Tree and a Neural Network from Scratch
#### 3. Step 3: Result Analysis
### Conclusion: Lessons & Goals:
#### 1. Reflection and Learning
#### 2. Future Plans
### Reference
#### 1. References and Image Credits
<br>
## Abstract
**This project documents my journey in building a hybrid regression model that combines decision trees and neural networks, developed entirely from scratch**. Using the Kaggle “House Prices - Advanced Regression Techniques” dataset, I aimed to predict house sale prices based on structured housing data.
**Rather than relying solely on existing machine learning libraries, I challenged myself to manually implement both the decision tree algorithm and neural networks, integrating them to utilize the strengths of each method. The decision tree handles the broader data structure, while neural networks at the leaf nodes capture more localized patterns.**
**The conceptual introductions included in this report reflect my own learning process.** As I explored key ideas like regression, decision trees, neural networks, and evaluation metrics, **I took the time to summarize what I learned and write them into the introduction section, ensuring I truly understood each component before moving forward.**
**Throughout the report, I share the technical challenges I encountered, how I approached solving them, and the insights I gained at each stage.** This project represents more than just building a model—it captures my growth in understanding machine learning principles from the ground up.
<br>
## Motivation
My journey into computer science didn’t begin with machine learning—it started with curiosity. When I first explored programming through C++, **I was fascinated by how logic and structure could be used to build real, functioning systems**. Around the same time, **I came across an MIT OpenCourseWare lecture on machine learning.** **That was the moment a new question formed in my mind: How does a machine actually learn patterns from data?**
As I dove deeper, I began to recognize how deeply machine learning is embedded in everyday life—from YouTube recommendations to chatbots and virtual assistants. **I started learning the basics through YouTube tutorials and reading online materials**, and eventually discovered Kaggle. **I chose this house price prediction competition as a way to turn what I’d learned into something concrete.**
Because I was also learning data structures in C++, I had just explored how decision trees work. That gave me an idea: **what if I combined this classic tree-based structure with something more modern like neural networks?** **Inspired by how deep learning powers intelligent systems, I decided to implement a hybrid model—a decision tree with a neural network at each leaf node.**
**This project isn’t just about making predictions. It’s my way of connecting foundational computer science skills with real-world machine learning.** More importantly, it marks a turning point—where I moved from learning concepts to building with them. **In the future, I hope to apply machine learning to practical problems, from financial forecasting to healthcare, education, and beyond.**
<br>
## Objective
The objective of this project is to **predict house sale prices using a hybrid model that combines decision trees and neural networks, built entirely from scratch.**
We begin by training this hybrid model on the dataset provided in **train.csv**, then use it to generate predictions on unseen data in **test.csv**. Final predictions are submitted in the format specified by **submission.csv**.
The dataset includes a variety of **housing features**—such as size, shape, location, and roof style—which are used to guide both tree-based splits and neural network learning.
Unlike traditional models, **this project enhances prediction accuracy by incorporating a small neural network at each leaf node of the decision tree.** This allows each terminal node to learn more nuanced, localized patterns rather than relying solely on average values.
**Model performance is evaluated using the Root Mean Squared Error (RMSE)** between the logarithm of predicted and actual sale prices, where a lower RMSE indicates a better-performing model.

<br>
## Training Flow
**This task is a classic supervised regression problem**, a core topic in the field of machine learning.
**The general training flow is:**
```
[x, y]
x → model → y
new x → model → predicted y
```
<br>
## Introduction to Decision Trees
### 1. Decision Tree: Basic Structure and Logic
* A decision tree is a data structure that **makes predictions by splitting data based on decision rules**.
* The structure resembles a flowchart, made up of:
1. **Root Node** –> The starting point of the tree
2. **Internal Nodes** –> Each tests a condition on a feature
3. **Branches** –> Represent the outcomes of the condition (yes/no, true/false)
4. **Leaf Nodes** –> Final outputs or predictions
***-> In my project, data is split using whether passed data is greater or lesser than the threshold in individual nodes. So it's a yes/no split.***

### 2. How a Decision Tree Learns
* The tree splits the data by finding the best feature and threshold that separates the data into two groups.
* After finding the best split, the node will remember what it used as splitting criteria.
***-> In my case, (feature, threshold) pair.***
* At each node, it evaluates many possible splits and chooses the one that minimizes error, such as:
1. Mean Squared Error (MSE) for regression
2. Gini impurity or entropy for classification
* **This process repeats recursively**, growing the tree deeper layer by layer. Repeating recursively means it will **keep calling the function inside the function** until it finishes.
* Eventually, the tree stops growing when:
1. It reaches a maximum depth
2. There are too few samples left to split further
***-> In my project, I use summation of left and right child node's MSE to determine the split. Also I use DFS(Depth-First-Search) method to grow the tree recursively.***
<br>
## Introduction to Neural Networks
### 1. Neural Network: Structure and Learning Process
* A neural network is made up of multiple layers, and each layer contains multiple(or single) neurons.
* A neural network has:
1. Input Layer -> First Layer
2. Hidden Layers -> Between the first and last layer
3. Output Layer -> Last Layer
* Each neuron has:
1. Inputs
2. Weights (W)
3. Bias (b)
4. Activation function
* In most cases (especially in fully connected layers), each neuron performs a transformation using the formula:
**Output = X ⋅ W + b**
***→ This process is called forward propagation.***

### 2. Data Transformation Methods
* There are two main types of transformations:
1. Linear transformation (most common):
Multiply input matrix by weight matrix, add bias.
2. Convolution operation (used in image-related tasks):
A kernel (small matrix) slides over the input matrix and performs element-wise multiplication + summation.
**This is a Linear Transformation:**

In this transformation, D represents the weight matrix and E represents the bias—both are learnable parameters(variables) that define how the input is linearly transformed into the output.
**This is a Convolution Operation:**

### 3. Activation Functions: Introducing Non-Linearity
* Activation functions are **used after each transformation** to add non-linearity.
* An activation function decides how much of a neuron's output should be passed to the next layer, **allowing the network to learn complex, non-linear patterns**.
* Common activation functions:
1. ReLU (Rectified Linear Unit):
* **Used in hidden layers.**
* Fast and efficient, avoids vanishing gradient.
* Formula: f(x) = x if x > 0, else 0
2. Sigmoid:
* **Used in binary outputs** (e.g., 0/1, yes/no).
* Outputs values between 0 and 1 → useful as probabilities.
* Smooth gradient for learning.
**-> *In my project, a linear activation function is used, meaning it just passes on the value without applying any non-linear transformation.***

### 4. How Neural Networks Learn: Backpropagation
* After making a prediction, **the model calculates the error using a loss function.**
***-> In my project, MSE(Mean-Squared-Error) is used as the loss function***
* The model then runs **backpropagation**:
1. Compare prediction with actual value.
2. **Adjust each weight based on how much error that neuron caused**.
3. This update is based on the gradient (slope) of the loss with respect to each weight.
* Components involved:
1. **Loss function**: Measures error (how wrong the model was).
2. **Gradient descent**: Finds the direction to update weights to reduce error.
3. **Learning rate**: Controls step size during weight updates.
* Too small → slow learning
* Too large → unstable training
***-> This process is called back propagation.***
* **One full cycle of forward propagation + back propagation = 1 epoch.**
**We run multiple epochs to gradually reduce error and improve predictions.**

***-> In my project, we have in total 20 epochs.***
<br>
## What is DFS(Depth-First-Search) vs BFS(Breadth-First-Search)?
### 1. DFS(Depth-First-Search)
* DFS explores as far as possible along each branch before backtracking.
* It builds one full path before moving to the next.
### 2. BFS(Breadth-First-Search)
* BFS explores all nodes at the current level before moving to the next level.
* Less commonly used for tree construction, but often used in search algorithms or shortest path problems.

***-> In my project, we use DFS to recursively build the decision tree.***
<br>
## What is MSE(Mean-Squared-Error)?
* **MSE is the key loss function used in this regression problem.**
* MSE measures how far off our predicted values are from the actual predictions.
* It squares the difference, so larger errors are punished more heavily.
* **A lower MSE means a better predictions.**

***-> In my project, MSE is used to evaluate each split in the decision tree. It is also used as the loss function for each individual leaf node.***
<br>
## Label Prediction Combined with Neural Network
* To make our final prediction(label) more accurate, we don't just stop at the basic mean value at the leaf node
* Instead, **we use a small neural network at each leaf node** to predict the house prices more precisely.
* This way, **every leaf node learns a local pattern based on the group of houses it receives, rather than just averaging them.**
* We call this function **label_network()**, and it performs:
X (features) → Neural Network → y (predicted label)
```mermaid
graph TD
Start[Start: Call _grow_tree X, y, depth=0] --> N1[1: Create Node 1 Split by Feature f1]
N1 --> N2[2: Create Left Node 2]
N2 --> L1[3: Leaf Node-label]
N2 --> L2[4: Leaf Node-label ]
N1 --> N3[5: Create Right Node 3]
N3 --> L3[6: Leaf Node-label ]
N3 --> L4[7: Leaf Node-label]
subgraph via Neural Network
L1
L2
L3
L4
end
```
<br>
## Neural Network used in This Project (Leaf Node Model)
* **Each leaf node in the decision tree contains its own neural network.**
* Structure:
1. Input: **56 features (after label encoding)**
2. Layers:
* Layer 1: 5 neurons
* Layer 2: 10 neurons
* Layer 3: 10 neurons
* Layer 4 (output): 1 neuron
* Training process:
1. A single leaf node might contain data for 50 houses, normally a group of houses.
2. Each house’s 56 **features are passed through the network**.
3. At each neuron:
* **Inputs are multiplied by weights, added to a bias, passed to the next layer.**
4. **Final output is one predicted sale price for that group of houses.**
* After predictions:
1. The network compares its output to the actual prices **using MSE**.
2. Then it **performs backpropagation to adjust all weights**.
3. This loop (predict → compare → adjust) runs for **20 epochs**(in this project) **per leaf node**.
* End result:
1. **Each leaf node has its own trained neural network that can predict the price of any new house that lands there.**
2. **After training, each leaf node's weight matrix is fixed, but prediction price will depend on the house input.**
<br>
## Step 1 - Data PreProcessing
### 1. Load the Dataset into DataFrames.
Using read_csv() function from the Pandas Library to load the training and testing data from CSV files into DataFrames.
```python
train_df=pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")
test_df=pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")
```
### 2. Handle missing values appropriately to ensure data integrity.
**We check for missing values from the training data using:**
```python
print(train_df.isnull().sum())
```
**To maintain data quality:**
* Drop columns with less than 700 usable values.
```python
drop_columns=train_df.columns[train_df.count()<700]
traindf=train_df.drop(drop_columns,axis=1,inplace=False)
testdf=test_df.drop(drop_columns,axis=1,inplace=False)
```
* Drop columns that don't contribute to the final prediction.
```python
train_df=train_df.drop('Id',axis=1)
test_df=test_df.drop('Id',axis=1)
```
* Separate "Sale Price" from training data then store in train_Y, which will be our prediction target. And train_X (with "Sale Price" removed) will be our training dataset.
```python
train_Y=traindf['SalePrice']
train_X=traindf.drop('SalePrice', axis=1,inplace=False)
```
* Make a function to fill missing numerical values with the mean (average value) of each column and drop columns with missing data. Run the function then create new dataframes trainx and testx to hold the cleansed version.
```python
def pandas_fillna(data,num_col,obj_col):
data[num_col]=data[num_col].fillna(data[num_col].mean())
obj_no_na = data[obj_col].dropna(axis=1)
data = pd.concat([data[num_col], obj_no_na], axis=1)
return data
trainx=pandas_fillna(train_X,numerical_columns,object_columns)
testx=pandas_fillna(testdf,numerical_columns,object_columns)
```
* Make sure trainx and testx correspond by keeping the common columns.
```python
final_col=list(set(trainx.columns) & set(testx.columns))
trainx=trainx[final_col]
testx=testx[final_col]
```
***→ Now we have a total of 1460 rows(houses), and 56 columns(features).***
<br>
#### Reflections:
**Managing missing values was more challenging than expected. I initially assumed I could simply drop all rows with missing values, but that approach would have discarded valuable data. Through experimentation, I learned to drop columns with excessive missing values and also filling some with the mean value. This step taught me how crucial it is to treat each column differently, and how thoughtful preprocessing directly impacts final model performance.**
<br>
### 3. Use Label Encoder to convert string-type column names into numeric codes.
To handle non-numerical (categorical) features, we first identify columns with data type "object". We then import the LabelEncoder class from scikit-learn, which is used to convert text labels into numeric codes.
A label_encoder object is created, and a loop is used to apply fit_transform on each column in the training data.
***→ This allows the model to learn how to encode each string label into a corresponding number.***
```python
object_columns=trainx.select_dtypes(include=['object']).columns
from sklearn.preprocessing import LabelEncoder
label_encoder=LabelEncoder()
for column in object_columns: #changed!!!
trainx[column] = label_encoder.fit_transform(trainx[column])
testx[column] = label_encoder.transform(testx[column])
```
<br>
#### Reflections:
**While encoding seemed straightforward at first, I ran into a key issue: applying fit_transform() on both training and test data caused inconsistencies when test data had unseen labels. I had to revise my method and ensure that the encoding was learned only from the training data, and then applied to the test set using transform(). This experience helped me understand how easily data leakage can occur if preprocessing steps aren’t carefully handled.**
<br>
### 4. Visualize the data to observe the distribution of numerical features, using matplotlib and seaborn modules, in the plots below, blue represents training data, green represent testing data.
```python
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Observe numerical feature distribution
def manifest(data1,data2):
num_columns = data1.columns
num_features = len(num_columns)
n_cols=4
n_rows= int(np.ceil(num_features/n_cols))
plt.figure(figsize=(n_cols * 4, n_rows * 3))# 呼叫一張底圖
for i, column in enumerate(num_columns, 1):
plt.subplot(n_rows, n_cols, i) # subplot(rowm,column, 現在在第幾張圖)
sns.histplot(data1[column], color='blue', label='train') # plt第一個variable (1d array) 數值分布圖(x ->數值,y->數值有幾個)
sns.histplot(data2[column], color='green', label='test')
plt.title(f'{column}') #string "abcsde" f'{a}=6'
plt.xlabel(column)
plt.ylabel('Frequency')
plt.suptitle('Feature distribution',fontsize=16,weight='bold',y=1.005)
plt.tight_layout()
plt.show()
manifest(trainx,testx)
```

<br>
#### Reflections:
**Visualizing feature distributions helped me uncover patterns I hadn’t noticed in raw numbers. However, I struggled at first with the overlapping parts that were hard to read. I learned to adjust plot settings like transparency (alpha) and layout sizing to make the visualizations clearer. This step reinforced the importance of visualization in identifying data imbalances and understanding feature behavior.**
<br>
### 5. Standardize the training and testing datasets to ensure that all numerical features are on the same scale, which improves model performance and stability.
Using StandardScaler from scikit-learn to standardlize both the training and testing datasets, then storing them into new DataFrames trainx_scaled and testx_scaled.
```python
from sklearn.preprocessing import StandardScaler
# 初始化 scaler
scaler = StandardScaler()
# 根據 trainx fit
trainx_scaled = pd.DataFrame(scaler.fit_transform(trainx),columns=trainx.columns)
# 用同樣的 scaler transform testx
testx_scaled = pd.DataFrame(scaler.transform(testx),columns=testx.columns)
```
**Note: Later removed this part of code**
<br>
#### Reflections:
**At first, I thought standardizing the features using StandardScaler would help the model perform better, especially since neural networks usually benefit from features being on the same scale. However, after experimenting, I realized it didn’t improve my predictions. This is likely because decision trees don’t care about feature scale, and my neural networks were small and trained only within leaf nodes. So in the end, I removed this step. This experience taught me an important lesson: just because something is commonly used in machine learning doesn’t always mean it’s necessary. It’s important to test each part and keep only what truly helps.**
<br>
### 6. Visualize Feature Correlation with Target.
```python
import matplotlib.pyplot as plt
import seaborn as sns
numeric_features = train_X.select_dtypes(include=['int64', 'float64']).columns #changed!!!
n_cols = 5 #show five plots for each row
n_rows = (len(train_X.columns)) // n_cols + 1 #calculate how many columns is required
plt.figure(figsize=(20,2 * n_rows))
for i, feature in enumerate(numeric_features):
plt.subplot(n_rows, n_cols, i + 1)
plt.scatter(x=train_X[feature], y=train_Y)
plt.xlabel(feature)
plt.ylabel('SalePrice')
plt.tight_layout() #adjust spacing
plt.show()
```
This visualization helps illustrate which features in the dataset are more predictive of house prices. Features such as GrLivArea, TotalBsmtSF, and OverallQual show strong, visible upward trends with respect to SalePrice, indicating high predictive power. In contrast, others like PoolArea, MiscVal, and LowQualFinSF show weak or no visible correlation.
It visually supports the idea that **some features are more predictive than others**, guiding both feature selection in preprocessing and splitting behavior during decision tree construction.

#### Reflections:
**At this point in the project, I wanted to get a clearer sense of which features might actually be important for predicting house prices. Writing code is one thing, but being able to visually spot strong or weak relationships really helped me understand my data better. Some features showed a strong upward trendm while others looked completely scattered. These visualizations helped me build intuition about what mattered and later confirmed that the tree’s feature importance aligned with those insights. As someone still learning, this step gave me a real sense of achievement—like I was starting to see the logic behind the predictions.**
<br>
## Step 2 - Build a Decision Tree and a Neural Network From Scratch

### 1. Define Node Class to Structure the Tree.
A decision tree is made up of many nodes, and each node stores important information, to be precise, the (feature, threshold) pair. The node class below defines what each node contains: the feature to split on, the threshold value, the left and right child node. If it is a leaf node(end of the tree), then it a prediction value.
```python
class Node:
def __init__(
self, feature=None, threshold=None, left=None, right=None, *, value=None
):
self.feature = feature
self.threshold = threshold
self.left = left
self.right = right
self.value = value
```
The function is_leaf_node() checks if the current node is a leaf node by checking if it stores a prediction value.
```python
def is_leaf_node(self): #if leaf node, then will have value.
return self.value is not None
```
<br>
#### Reflections:
**When I started writing the Node class, I didn’t fully understand how each part—feature, threshold, left/right children come together. But once I saw how these components represent decisions and paths in the tree, it all made sense. This small class became the foundation for everything that followed, and writing it helped me understand how abstract logic turns into real prediction flow. It felt like building a skeleton for the model, one piece at a time.**
<br>
### 2. Define the Structure and Training Logic for Decision Tree + Neural Network
The model is made from the class DecisionNetwork_regression. This class controls how the tree grow, what (feature, threshold) pair is selected, how the neural network is trained at the leaf nodes, and also how predictions are made.
```python
class DecisionNetwork_regression:
def __init__(self,n_trees=None, min_samples_split=2, max_depth=20, n_feats=None):
self.min_samples_split = min_samples_split
self.max_depth = max_depth
self.n_feats = n_feats
self.root = None #root of the tree will be assigned later when fit it called
self.mse_dict={}
self.mse=[]
self.feature_dict={}
```
When we initialize the model, we specify max_depth(maximum depth to grow) and min_samples_split(the minimal amount of samples needed for a split) to control when the tree should stop growing. Later during training, the tree will grow recursively based on these limits. The constructor __init__ is constructed upon initialization.
* **self.mse_dict**: A dictionary used to store the MSE value at each depth of the tree
* **self.mse**: To store the average MSE at each depth level after the whole tree has been built.
* **self.feature_dict**: To track how many times each feature is used for splitting across the entire tree.
* **self.min_samples_split**: Minimum samples needed for a split. If a node is lower than this value, then this node is a leaf node.
* **self.max_depth**: Maximum depth this tree grows. If maximum depth is achieved, final node becomes leaf node.
* **self.n_feats**: Controls how many features are randomly selected at each node.
<br>
#### Reflections:
**To make this section work, I had to learn and apply important concepts like:**
* **Recursion –> for how the tree grows layer by layer**
* **Depth tracking –> to stop the tree from growing infinitely**
* **Feature selection and MSE calculation –> to decide the best split at each node**
* **Dictionary tracking –> to store results like feature usage and MSE at each depth**
**As I implemented each part, I started to realize how a model is more than just predictions. It’s a carefully structured system where every component plays a crucial role. Writing it out line by line helped me understand what’s usually hidden in behind the scenes.**
<br>
### 3. Train the Tree Using fit() and Recursively Grow It
Once the class is initialized, we train the model by calling the fit(X, y) function. This function triggers _grow_tree(), which begins building the tree using depth-first search (DFS). Each node tries to find the best feature and threshold to split the data. During training, we also track which features were used and store the MSE at each level of the tree.
```python
def fit(self, X, y):
self.depth=0
self.mse_dict={}
self.feature_dict={}
#self.n_feats=X.shape[1]
self.n_feats = X.shape[1] if not self.n_feats else min(self.n_feats, X.shape[1])
self.root = self._grow_tree(X, y)
sorted_dict = dict(sorted(self.mse_dict.items()))
self.mse= [sum(values) / len(values) for values in sorted_dict.values()]
return self.mse,self.depth+1,self.feature_dict
```
* **self.root = self._grow_tree(X, y)**: This is where we begin recursively building the tree from the top.
* **Depth-First-Search(DFS) Explanation**: DFS (Depth-First Search) is a fundamental tree traversal strategy where the algorithm goes as deep as possible down one path before backtracking. In this project, it's used to recursively build the decision tree: each node first grows its left child completely, then proceeds to grow the right child, continuing this process until stopping criteria are met.
<br>
### 4. Recursively Grow the Tree with _grow_tree() function
After calling the fit() function, the core of the training happens in the _grow_tree() function. This function grows the tree recursively using depth-first search. At each node, it checks if the stopping criteria are met. If not, it finds the best (feature, threshold) pair to split the data and creates left and right child nodes.
```python
def _grow_tree(self, X, y, depth=0):
n_samples, n_features = X.shape
self.depth = max(self.depth,depth)
#stopping criteria
if (
depth >= self.max_depth
or n_samples < self.min_samples_split
):
if(n_samples==0):
leaf_value=0
else:
leaf_value=np.mean(self.label_network(X,y))
return Node(value=leaf_value)
```
* **Stopping Criteria**: If the current depth reaches max_depth or the number of samples is smaller than min_samples_split, the node becomes a leaf.
* **leaf_value**: If this node is a leaf, we calculate the predicted price using a neural network, and assign it to leaf_value.
The function randomly selects a subset of features and tests different threshold values to find the pair with the lowest MSE. Afterwards, stores the MSE of the current depth in mse_dict and tracks feature usage frequency in feature_dict.
```python
feat_idxs = np.random.choice(n_features, self.n_feats, replace=False)
# greedily select the best split according to information gain
best_feat, best_thresh,min_mse = self._best_criteria(X, y, feat_idxs)
# grow the children nodes that result from the split
#------------------mse_dict---------------------------
if (depth not in self.mse_dict):
self.mse_dict[depth]=[]
self.mse_dict[depth].append(min_mse)
else:
self.mse_dict[depth].append(min_mse)
##------------------feature importance----------------
if (best_feat not in self.feature_dict):
self.feature_dict[best_feat]=0
else:
self.feature_dict[best_feat]=self.feature_dict[best_feat]+1
```
Then the node splits the dataset and recursively builds the left and right subtrees:
```python
left_idxs, right_idxs = self._split(X[:, best_feat], best_thresh)
left = self._grow_tree(X[left_idxs, :], y[left_idxs], depth + 1)
right = self._grow_tree(X[right_idxs, :], y[right_idxs], depth + 1)
return Node(best_feat, best_thresh, left, right)
```
* **feat_idxs**: Which features to test, randomly pick a subset. Not using all features to avoid overfitting.
* **left_idxs / right_idxs**: How houses are split based on the selected (feature, threshold).
<br>
#### Reflections:
**When I first tried to begin the training process, I was stuck for days—I couldn’t understand how a tree could "build itself." The concept of recursion was hard to grasp, especially when combined with depth tracking, MSE calculation, and feature importance updates all at once.
What finally helped was learning about Depth-First Search (DFS). Once I understood that the tree grows by fully expanding one branch before returning to the previous level, everything started to make sense. It was still tricky to follow how each part of the function interacted, but after breaking it down step by step, I was able to piece it together.
This was one of the most challenging but also most rewarding moments in the project, because it taught me how abstract ideas like recursion become powerful once I truly understand how they work in practice.**
<br>
### 5. Find the Best Split Using _best_criteria() function
At each node during training, we need to decide which feature and which threshold to use to split the dataset. This decision is made using the _best_criteria() function. It tests many different combinations and picks the one that results in the lowest error (MSE) when the data is split into two groups.
```python
def _best_criteria(self, X, y, feat_idxs):
min_mse = float('inf')
min_diff_idx = float('inf')
split_idx, split_thresh = None, None
for feat_idx in feat_idxs:
X_column = X[:, feat_idx]
if len(np.unique(X_column)) > 13:
thersholds = self.calculate_quartiles(X_column)
else:
thersholds = np.unique(X_column)
for threshold in thersholds:
mse, diff_idx = self.calculate_mse(y, X_column, threshold)
if diff_idx < min_diff_idx and mse < min_mse:
min_diff_idx = diff_idx
min_mse = mse
split_idx = feat_idx
split_thresh = threshold
return split_idx, split_thresh, min_mse
```
**Thresholds**:
* If a feature has many different values, calculate 25th, 50th, and 75th percentile (quartiles).
* If the feature doesn't vary much, use all its unique values directly.
**Test all pairs**: Try every (feature, threshold) combination and calculate how well it splits the data.
**Choose the best one**: Keep track of the combination that gives the lowest MSE and most balanced split.
<br>
#### Reflections:
**At first, I thought selecting a threshold would be straightforward—just test a few values and pick the best one. But after doing it, I quickly realized how difficult it was to decide which values to test and how to evaluate them effectively. There were too many possibilities, and testing every single unique value was inefficient. I struggled with finding a balance between precision and speed. Eventually, I learned to use quartiles for features with many special values and exact values for simpler ones. This taught me that threshold selection is a crucial step in defining the prediction outcome.**
<br>
### 6. Calculate the MSE for a Given Split Using calculate_mse()
After determining a candidate threshold for a feature, we must evaluate how well it splits the data. This is done using the calculate_mse() function. It calculates the Mean Squared Error (MSE) by measuring how far the predicted price (mean of the group) is from the actual price for each house.
```python
def calculate_mse(self, y, X_column, split_thresh):
# generate split
left_idxs, right_idxs = self._split(X_column, split_thresh)
diff_idx=abs((len(left_idxs)-len(right_idxs))//4)
#use mean value determine label and calculate mse:
if len(left_idxs)!=0:
mean_y_left=self.mean_label(y[left_idxs])
#y_left=self.label_network(X_column[left_idxs],y[left_idxs])
left_mse=(1/y[left_idxs].shape[0])*np.sum(np.abs(y[left_idxs]-mean_y_left))
else:
left_mse=0
if len(right_idxs)!=0:
mean_y_right=self.mean_label(y[right_idxs])
#y_right=self.label_network(X_column[right_idxs],y[right_idxs])
right_mse=(1/y[right_idxs].shape[0])*np.sum(np.abs(y[right_idxs]-mean_y_right))
else:
right_mse=0
mse=left_mse+right_mse
return mse,diff_idx
```
* **left_idxs / right_idxs**: Split the houses into two groups based on the threshold.
* **mean_y_left / mean_y_right**: For each group, calculate the average sale price.
* **Mean-Squared-Error(MSE)**: For each house, subtract its actual sale price from the mean, take the absolute error, and compute the total error for the group.
<br>
#### Reflections:
**While designing the splitting method, I had to figure out how to measure what makes a "good" split. I decided to use the sum of the MSE from the left and right child nodes as my evaluation metric. At first, it seemed too simple, but after testing it, I saw that it guided the tree to better groupings. This made me realize how central the loss function is. Not only in training, but also in how the model makes decisions.**
<br>
### 7. Split the Data and Traverse the Tree
After selecting the best (feature, threshold) pair, we need a way to split the dataset accordingly. The _split() function separates the dataset into left and right groups based on the threshold. Later, during prediction, _traverse_tree() will guide each input house through the tree and return the final prediction.
```python
def _split(self, X_column, split_thresh):
left_idxs = np.argwhere(X_column <= split_thresh).flatten()
right_idxs = np.argwhere(X_column > split_thresh).flatten()
return left_idxs, right_idxs
```
Next, when making predictions, we need to send new houses through the trained tree to find their final predicted price. This is done with _traverse_tree():
```python
def _traverse_tree(self, x, node):
if node.is_leaf_node():
return np.squeeze(node.value)
if x[node.feature] <= node.threshold:
return self._traverse_tree(x, node.left)
else:
return self._traverse_tree(x, node.right)
```
**Base case**: If the node is a leaf, return its predicted value (the output from the trained neural network).
**Recursive case**: Compare the house’s feature value with the node’s threshold and move either left or right.
<br>
#### Reflections:
**This part was easier to understand compared to earlier steps. Once I had the best feature and threshold, splitting the data using NumPy functions felt straightforward. Writing the traversal function also came naturally after I understood the tree’s structure. Seeing how a single sample flows from root to leaf made the prediction process more intuitive and satisfying to implement.**
<br>
### 8. Train a Neural Network at Leaf Nodes using label_network() Function
**We use tensor flow to build and train a neural network for each leaf node to predict the sale price of houses in that specific group.**
This adds flexibility and allows for finer predictions based on the patterns within each leaf node’s dataset.
Architecture of the Neural Network:
```python
def label_network(self,X,y):
model = Sequential()
# Input layer and first hidden layer with 5 neurons
model.add(Dense(5, input_dim=X.shape[1], activation='linear'))
# Two additional hidden layers with 10 neurons each
model.add(Dense(10, activation='linear'))
model.add(Dense(10, activation='linear'))
# Output layer with 1 neuron to predict the sale price
model.add(Dense(1, activation='linear'))
# Compile the model using MSE loss and RMSprop optimizer
model.compile(loss='mean_squared_error',
optimizer=RMSprop(learning_rate=0.01),
metrics=['mean_squared_error'])
# Define a batch size (smaller if fewer samples)
batch = 1 if (X.shape[0] // 10 == 0) else int(X.shape[0] // 3)
# Train for 20 epochs
model.fit(X, y, epochs=20, batch_size=batch, verbose=0)
# Use the model to predict prices
y_pred = model.predict(X, verbose=0)
return y_pred
```
<br>
#### Reflections:
**This was the most difficult and defining part of the entire project. When I first had the idea of placing a neural network at each leaf node, I honestly didn’t know where to start. The concept sounded cool, but I had no idea how to actually make it work. I kept wondering: How do you train a network inside a tree? How do you make sure each one gets the right data?
At that point, I was still learning how neural networks work on their own — backpropagation, layers, weights, activation functions, and now I was trying to integrate all of that into a recursive tree structure. It felt like I was juggling too many things at once. I remember spending hours trying to get the shapes to match, deciding how many layers to use, figuring out how to avoid overfitting with small datasets at each node. It was really frustrating.
But I didn’t want to give up on this idea. I kept reading, testing, and rewriting. Slowly, it started to come together. When I finally saw the networks learning inside the leaf nodes — each one adapting to its local group of houses. It was the most satisfying moment of the whole project. This section became the soul of my hybrid model.**
<br>
### 9. Predict Final House Prices Using predict()
Once the tree is fully trained and each leaf node has its own neural network, we are ready to make predictions on new, unseen data. This is handled by the predict() function. When this function is called, it takes in new house data (with features) and sends each row through the tree using the _traverse_tree() function.
```python
def predict(self, X):
y_pred = np.array([self._traverse_tree(x, self.root) for x in X])
return y_pred
```
***-> At this point, the model no longer uses MSE or backpropagation. It simply flows input through the trained structure and uses the result.***
<br>
#### Reflections:
**This final-stretch brought everything together. Although the prediction code was relatively short, it was meaningful — it proved that the entire system I built could now produce real outputs. Watching the model make predictions based on everything I had designed felt like the most incredible feeling out of all the work I'd done before.**
<br>
## Step 3 - Result Analysis
### 1. MSE Prediction
* During training, we use a dictionary called mse_dict to store the MSE values for each tree depth:
```
{1:[node1 mse ,node2 mse],2:[node1 mse ,node2 mse]}
```
* After training, we sort the dictionary by depth and calculate the average MSE at each level:
```python
sorted_dict = dict(sorted(self.mse_dict.items()))
self.mse= [sum(values) / len(values) for values in sorted_dict.values()]
```
**We then plot these averaged values to observe how the model error changes with tree depth:**

***-> We can observe that the mse value decreases with tree depth.***
### 2. Feature Importance
During training, we use a dictionary to count how frequently each feature is used to split nodes in the decision tree.
**This split frequency reflects the relative importance of each feature**—features used more often are likely to have a stronger influence on the model’s predictions.
***After sorting, here is our feature importance plot:***

* We can see that `OverallQual` (overall quality of the house), `GrLivArea` (above-ground living space), and `GarageCars` (garage size) contribute the most in final prediction.
* Other helpful features include `TotalBsmtSF` (basement size), `YearBuilt` (how new the house is), and `1stFlrSF` (first floor area).
* Some features, like `PoolArea`, `KitchenAbvGr` (number of kitchens above ground), and `MiscVal` (miscellaneous extra items), don’t help much. They might only matter for a few houses, so the model doesn’t rely on them much.
### 3. Final Results
**Predicted vs. Actual Values:**

***-> The predicted and actual values in the validation set show a similar trend, suggesting that our decision tree model performs well.***
<br>
**Prediction Distribution Across the Dataset:**

***->While the distribution is spiky due to the diverse nature of the input features, it shows that our hybrid model is capturing the general range and fluctuations of house prices.***
<br>
## Final Results

***-> We evaluate our model using the Root Mean Squared Error (RMSE) between the logarithm of the predicted values and the logarithm of the observed sales prices. Although the final score is not particularly strong, it is important to note that our model is novel in its approach.***
<br>
## Reflection and Learning
Looking back on this project, **I’m proud of how far I’ve come—not just in terms of code, but in understanding how machine learning actually works behind the scenes.** **What began as a simple curiosity about prediction models turned into one of the most technical and rewarding challenges I’ve ever taken on.**
At the beginning, I struggled with core preprocessing decisions: how to clean missing data, when to drop columns, and how to make sure the train and test sets aligned. These steps seemed straightforward in tutorials, but applying them to a real dataset required judgment and experimentation. I even tried adding feature scaling with StandardScaler, only to later remove it when I realized it didn’t improve predictions. **That taught me the importance of testing ideas rather than assuming they’ll work.**
Building the decision tree from scratch was where things got difficult—but also exciting. I was confused by how a tree could "grow" itself. Recursion felt abstract at first, especially when I had to track depth, feature usage, and MSE across nodes. **What unlocked it for me was learning about Depth-First Search (DFS).** Once I understood that the tree expands one branch fully before backtracking, the logic finally clicked. It was a major breakthrough and made the tree-building process feel tangible.
Picking the threshold to split nodes was another unexpected challenge. I had to figure out how to measure whether a split was “good,” and ended up using the summation of MSE from both left and right child nodes. This wasn’t something I had seen in a tutorial—it was a decision I had to reason out and justify on my own. **The process taught me how real machine learning isn’t about copy-pasting solutions; it’s about making choices based on logic and feedback.**
**But the most difficult and rewarding part of this project was the idea behind the hybrid model itself: combining a decision tree with a neural network.** Implementing a small neural network at each leaf node wasn’t just technically complex—it challenged me conceptually. I had to understand how neural networks process input, how backpropagation works, and how to integrate that into a tree structure. This was the first time I saw how different parts of machine learning could be stitched together into one system. It felt like building something truly my own.
Seeing the final predictions, MSE plots, and feature importance graphs gave me a real sense of closure. These weren’t just numbers—they were confirmation that my model was learning, that my code was working, and that I had made it through every major step. **For the first time, I wasn’t just learning machine learning—I was doing it.**
This project pushed me to understand recursion, evaluation metrics, model architecture, and training flow in a way no lecture ever could. **It taught me to debug, adapt, and most importantly, keep going when things didn’t work the first time.** **More than anything, it gave me the confidence to take on even bigger challenges ahead.**
<br>
## Future Plans
Building this hybrid model is only the beginning. In the future, **I hope to explore more advanced machine learning techniques**—such as ensemble learning, feature selection algorithms, and deep learning architectures. I also plan to improve this project by introducing regularization methods, hyperparameter tuning, and visual explanations to better interpret model decisions.
**Long term, I want to apply machine learning in real-world settings that can make a difference:** optimizing financial tools, improving educational technologies, or contributing to public health models. **I believe that combining solid foundational knowledge with practical implementation will allow me to create models that are not only accurate, but impactful.**
<br>
## References and Image Credits
1. Decision Tree Visualization - Source: https://www.appliedaicourse.com/blog/decision-tree-in-machine-learning/
2. Neural Network Visualization - Source: https://www.cudocompute.com/topics/neural-networks/what-is-a-neural-network
3. Convolution Operation Visualization - Source: https://learnopencv.com/understanding-convolutional-neural-networks-cnn/
4. Activation Functions - Source: https://medium.com/@shrutijadon/survey-on-activation-functions-for-deep-learning-9689331ba092
5. Neural Network Flow - Source: <https://wandb.ai/mostafaibrahim17/ml-articles/reports/Decoding-Backpropagation-and-Its-Role-in-Neural-Network-Learning--Vmlldzo3MTY5MzM1>
7. DFS/BFS - Source: https://minseok0123.github.io/DFS%EC%99%80%20BFS/
8. MSE Formula - Source: https://suboptimal.wiki/explanation/mse/