Introduction: From Simple Math to Artificial Intelligence
Neural networks might seem like magicβsystems that can recognize faces, translate languages, and generate human-like text. But at their foundation, theyβre built from something surprisingly simple: individual neurons performing basic mathematical operations.
In this article, weβll build neural networks from the ground up. Youβll write and run real Python code in your browser, and by the end, youβll understand:
- What a neuron actually computes and why it matters
- Why we need activation functions to create intelligent behavior
- How multiple neurons work together to solve complex problems
- What backpropagation is and how networks learn from data
- Why deep learning works at a fundamental level
The only prerequisite is basic Python and familiarity with NumPy. If you know what np.dot() does, youβre ready to start.
Letβs verify your environment is working:
Part 1: Understanding a Single Neuron
What Is a Neuron?
In biological brains, neurons are cells that receive signals from other neurons, process them, and send signals forward. Artificial neurons work similarly, but with pure mathematics.
An artificial neuron performs two fundamental steps:
- Weighted Sum: It takes multiple inputs, multiplies each by a weight (showing importance), and adds them together with a bias term
- Activation: It passes this sum through an activation function to produce the final output
Letβs visualize this structure:
ANATOMY OF AN ARTIFICIAL NEURON
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUTS WEIGHTS WEIGHTED SUM ACTIVATION OUTPUT
βββββββ
xβ = 0.8 ββββ Γ wβ = 0.5 βββ β β
β β β
xβ = 0.6 ββββ Γ wβ = 0.7 βββΌβββ Ξ£ = 1.07 βββ z ββββ Ο(z) βββ 0.76
β +0.10 +b β β
xβ = 0.9 ββββ Γ wβ = 0.3 βββ ββββ β β
1.17 βββββββ
bias b = 0.1 βββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Formula: output = Ο(wβxβ + wβxβ + wβxβ + b)
Where: Ο(z) = sigmoid activation = 1 / (1 + eβ»αΆ»)
A Practical Example: Should I Go Running?
Letβs make this concrete with a real-world decision: deciding whether to go for a run. Weβll use three factors:
- Weather score (0 = terrible, 1 = perfect): 0.8
- Energy level (0 = exhausted, 1 = energized): 0.6
- Available time (0 = no time, 1 = plenty): 0.9
Weβll assign weights to show how much each factor matters:
We get a score of 1.17. But thereβs a problem: what does 1.17 mean? Is that βyes, go runningβ or βno, stay homeβ? What if the score was 47.3 or -2.8? We need a way to interpret any number as a probability.
The Activation Function: Sigmoid
To convert any score into a probability between 0 and 1, we use an activation function. The sigmoid function is perfect for this:
1
Ο(z) = 1 / (1 + e^(-z))
The sigmoid function has useful properties:
- Output range: Always between 0 and 1 (perfect for probabilities)
- Smooth curve: Small changes in input produce small changes in output
- Interpretable: 0.5 is the decision boundary (50% confidence)
Letβs see it in action:
Now 1.17 becomes 76% confidenceβa clear βyes, go running!β
Visualizing the Sigmoid Function
Letβs plot the sigmoid function to understand its behavior:
The Complete Neuron Function
Now letβs package everything into a reusable neuron function:
Part 2: The Fundamental Limitation of a Single Neuron
The XOR Problem: A Classic Challenge
Weβve seen that a single neuron can make simple decisions. But thereβs a famous problem in machine learning history that exposed a critical limitation: the XOR problem.
XOR (exclusive OR) is a simple logical operation:
- Output 1 if inputs are different
- Output 0 if inputs are the same
Hereβs the truth table:
XOR TRUTH TABLE
βββββββββββββββββββββββββββββββββββ
Input 1 β Input 2 β Output
ββββββββββΌββββββββββΌβββββββββββββββββ
0 β 0 β 0 (same)
0 β 1 β 1 (different)
1 β 0 β 1 (different)
1 β 1 β 0 (same)
βββββββββββββββββββββββββββββββββββ
Letβs visualize this problem:
Why Canβt a Single Neuron Solve XOR?
A single neuron can only create a linear decision boundaryβa straight line (in 2D) or a flat plane (in higher dimensions). But XOR requires a non-linear decision boundary to separate the classes.
Letβs see what happens when we try:
Visualizing the Linear Boundary Limitation
Letβs see exactly why a single neuron fails by plotting its decision boundary:
This limitationβthat a single neuron can only create linear boundariesβis called linear separability. XOR is a linearly non-separable problem, and itβs everywhere in real-world data.
This is why we need neural networks with multiple neurons.
Part 3: Neural NetworksβCombining Neurons for Intelligence
The Key Insight: Multiple Neurons Create Non-Linear Boundaries
What if we use two neurons in a hidden layer, each creating their own linear boundary, and then combine their outputs with a third neuron? This creates the ability to solve non-linear problems.
Hereβs the architecture:
TWO-LAYER NEURAL NETWORK ARCHITECTURE
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT LAYER HIDDEN LAYER OUTPUT LAYER
(2 neurons) (1 neuron)
ββββββββββββββββ
xβ (Input 1) ββββββ Neuron A ββββββ
β β (learns OR) β β
β ββββββββββββββββ β βββββββββββββββ
β βββββ Output ββββ Prediction
β ββββββββββββββββ β β(combines A,B)β
xβ (Input 2) ββΌββββ Neuron B ββββββ βββββββββββββββ
β (learns AND) β
ββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
How XOR is solved:
β’ Neuron A fires when: At least ONE input is 1 (OR logic)
β’ Neuron B fires when: BOTH inputs are 1 (AND logic)
β’ Output computes: A is true BUT B is false (XOR logic)
Building a Two-Layer Network
Letβs implement this network:
Understanding What Each Hidden Neuron Learned
Testing the Complete XOR Solution
Visualizing the Non-Linear Decision Boundary
Part 4: Working with BatchesβThe Matrix View
Before we tackle learning, letβs clean up our code. Instead of processing one sample at a time, we can use matrix operations to process entire batches simultaneously.
Layer as a Matrix Operation
Batch Processing
Part 5: The Loss FunctionβMeasuring How Wrong We Are
So far, weβve used hand-picked weights. But how do we find good weights automatically? We need a way to measure how βwrongβ our predictions are.
Mean Squared Error (MSE)
The most common loss function for this task is Mean Squared Error:
1
MSE = (1/n) Γ Ξ£(prediction - target)Β²
MEAN SQUARED ERROR (MSE) CALCULATION
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sample Target Prediction Error Squared Error
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[0,0] 0 0.002 +0.002 0.000004
[0,1] 1 0.998 -0.002 0.000004
[1,0] 1 0.997 -0.003 0.000009
[1,1] 0 0.003 +0.003 0.000009
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sum: 0.000026
Average (Γ·4): 0.0000065 β MSE
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Low MSE = Good predictions (errors close to 0) β
High MSE = Bad predictions (errors far from 0) β
Random Weights = High Loss
The goal of training: Start with random weights and iteratively adjust them to minimize the loss function, bringing predictions closer to targets.
Part 6: BackpropagationβHow Networks Learn
The Gradient Descent Intuition
Imagine the loss function as a mountainous landscape. Your current weights place you somewhere on this landscape, and the loss is your altitude. Training is the process of walking downhill to find the lowest point.
But how do you know which direction is downhill? Thatβs where gradients come in.
GRADIENT DESCENT: Walking Downhill to Minimize Loss
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Loss
β²
β
β β Start: Random weights, High loss
β β
β ββββ Step 1: Follow gradient down
β β
β ββββ Step 2: Keep descending
β β
β ββββ Step 3: Getting closer
β β
β ββββ Step 4: Nearly there
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ Weights
β²
Goal: Minimum loss (optimal weights)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Gradient = Direction of steepest INCREASE in loss
We move OPPOSITE to gradient = Go DOWNHILL = Reduce loss
Computing Gradients: The Chain Rule
For each weight, we need to know: βIf I change this weight slightly, how much does the loss change?β
This is a derivative, and because our network is a chain of functions (input β hidden β output β loss), we use the chain rule to compute it. This process is called backpropagation because we compute gradients by working backward from the output.
Implementation: The Backward Pass
Propagating Back to Hidden Layer
Part 7: Complete TrainingβWatch the Network Learn
Now letβs put everything together: forward pass, loss calculation, backpropagation, and weight updates. Weβll train a network from scratch and watch it learn to solve XOR.
The Full Neural Network Class
Training from Scratch
Visualizing the Learning Process
Part 8: From NumPy to Real Frameworks
Everything weβve builtβforward passes, backpropagation, gradient descentβis exactly what modern deep learning frameworks do. They just do it faster, on GPUs, with more features.
The Keras Equivalent
Our 60 lines of NumPy code is equivalent to this in Keras:
1
2
3
4
5
6
7
8
9
10
11
12
13
from tensorflow import keras
# Define architecture
model = keras.Sequential([
keras.layers.Dense(4, activation='sigmoid', input_shape=(2,)),
keras.layers.Dense(1, activation='sigmoid')
])
# Specify optimizer and loss
model.compile(optimizer='sgd', loss='mse')
# Train
model.fit(X, y, epochs=3000, verbose=0)
The concepts are identical:
- Layers: Dense layers = our matrix multiplications + activation
- Loss: MSE = our loss function
- Optimizer: SGD (stochastic gradient descent) = our weight updates
- Fit: Training loop = our forward/backward passes
Conclusion: What Youβve Learned
Congratulations! Youβve built a complete neural network from scratch and understand:
Core Concepts
- Neurons compute weighted sums then apply activation functions
- Sigmoid activation converts any value to a probability (0-1)
- Single neurons create linear boundaries (canβt solve XOR)
- Multiple neurons create non-linear boundaries (can solve XOR)
- Matrix operations enable efficient batch processing
- Loss functions quantify prediction error
- Backpropagation computes gradients via the chain rule
- Gradient descent iteratively minimizes loss
The Big Picture
THE COMPLETE DEEP LEARNING TRAINING CYCLE
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 1: INITIALIZATION β
β β’ Create random weights and biases β
β β’ Network knows nothing yet β
ββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 2: FORWARD PASS β
β β’ Input β Hidden layers β Output β
β β’ Compute: activation = Ο(weights Γ input + bias) β
β β’ Generate predictions β
ββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 3: COMPUTE LOSS β
β β’ Compare predictions to true targets β
β β’ Calculate error: MSE = mean((pred - target)Β²) β
β β’ Quantify how wrong we are β
ββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 4: BACKPROPAGATION β
β β’ Compute gradients using chain rule β
β β’ Find βLoss/βWeight for every weight β
β β’ Determine how to adjust each weight β
ββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 5: GRADIENT DESCENT (Update Weights) β
β β’ weight_new = weight_old - learning_rate Γ grad β
β β’ Take small step downhill on loss landscape β
β β’ Network gets slightly better β
ββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β
β
βββββββββββββββββ
β Repeat 2-5 β
β for 1000s of β β Training loop
β epochs β
βββββββββ¦ββββββββ
β
β
βββββββββββββββββ
β β DONE! β
β Trained model β
β ready to use β
βββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
What Comes Next
Everything in modern deep learning builds on these foundations:
- CNNs (Convolutional Neural Networks): Specialized layers for images
- RNNs (Recurrent Neural Networks): Handle sequences (text, time series)
- Transformers: Attention mechanisms for language models (GPT, BERT)
- Regularization: Techniques to prevent overfitting (dropout, batch norm)
- Optimizers: Better than vanilla gradient descent (Adam, RMSprop)
But at their core, they all use the same principles youβve learned:
- Neurons computing weighted sums
- Activation functions introducing non-linearity
- Backpropagation computing gradients
- Gradient descent minimizing loss
You now understand the foundation of artificial intelligence.
π¬ Questions About This Article?
Ask anything about neural networks, neurons, backpropagation, or the concepts covered in this tutorial. The RAG system will answer based on the article content.
References
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). βLearning representations by back-propagating errorsβ - The original backpropagation paper
- LeCun, Y., Bengio, Y., & Hinton, G. (2015). βDeep learningβ (Nature) - Comprehensive overview by the pioneers
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). βDeep Learningβ (MIT Press) - The definitive textbook
- Nielsen, M. A. (2015). βNeural Networks and Deep Learningβ - Excellent free online book