Menu Close

Decision Trees Classification Algorithm in Machine Learning | DevDuniya

Rate this post

A decision tree is a type of supervised machine learning used for classification and/or regression to make predictions based on how a previous set of questions were answered.
A decision tree is a flowchart-like tree structure where an internal node represents a feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches.

Decision Tree in Machine Learning:

Structure:

  • Nodes: Represent features or attributes.
  • Branches: Represent decisions based on feature values (e.g., “age > 30”).
  • Leaves: Represent the predicted outcome (class label for classification).

Entropy in Decision Tree:

Entropy is a measure of impurity or disorder in a dataset. In the context of decision trees, it helps us quantify the uncertainty about the target variable. Lower entropy indicates that the data is more organized and easier to make decisions from. Entropy has a maximum impurity of 1 and maximum purity is 0.

Formula:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no

Gini Impurity in Decision Tree:

Gini impurity measures the probability of a randomly selected data point being incorrectly classified. A low Gini impurity indicates that the dataset is more pure and easier to classify. The Gini index has a maximum impurity is 0.5 and maximum purity is 0

Formula:

Gini Impurity = 1- ∑jPj2

Information Gain in Decision Tree:

Information gain is a metric used to decide which feature to split on at each node of the decision tree. It measures the reduction in entropy or Gini impurity achieved by splitting the data based on a particular feature.

Formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Pruning in Decision Tree:

Pruning involves removing branches or nodes from the decision tree to simplify and prevent overfitting. Overfitting occurs when the tree captures noise in the data and performs poorly on new, unseen data.

Overfitting: Decision trees can be prone to overfitting, meaning they perform well on the training data but poorly on unseen data.

Pruning: Techniques to simplify the tree and prevent overfitting.

  • Pre-Pruning: Limits the tree’s growth during the construction phase (e.g., setting maximum depth, minimum samples per leaf).
  • Post-Pruning: Removes branches from a fully grown tree.

Decision Tree Process:

1. Start with the entire dataset.
2. Calculate the entropy (or Gini impurity) of the target variable.

3. For each feature:

  • Calculate the entropy (or Gini impurity) of the target variable for each possible split value of the feature.
  • Calculate the information gain for the feature.

4. Select the feature with the highest information gain.
5. Split the dataset based on the selected feature.
6. Repeat steps 2-5 recursively for each subset of the data until a stopping criterion is met (e.g., maximum depth, minimum number of samples per leaf).

Advantages of Decision Trees

  • Easy to understand and interpret.
  • Can handle both categorical and numerical data.
  • Can handle missing values.

Disadvantages of Decision Trees:

  • Prone to overfitting.
  • Can be sensitive to small variations in the training data.
  • May not perform well with high-dimensional data.

Decision Tree Python Example:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42) 

# Train the model
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Suggested Blog Posts

Leave a Reply

Your email address will not be published.