A Neuron Is All It Took: Building Neural Networks from Scratch

Introduction: From Simple Math to Artificial Intelligence

Neural networks might seem like magicβ€”systems that can recognize faces, translate languages, and generate human-like text. But at their foundation, they’re built from something surprisingly simple: individual neurons performing basic mathematical operations.

In this article, we’ll build neural networks from the ground up. You’ll write and run real Python code in your browser, and by the end, you’ll understand:

  • What a neuron actually computes and why it matters
  • Why we need activation functions to create intelligent behavior
  • How multiple neurons work together to solve complex problems
  • What backpropagation is and how networks learn from data
  • Why deep learning works at a fundamental level

The only prerequisite is basic Python and familiarity with NumPy. If you know what np.dot() does, you’re ready to start.

Let’s verify your environment is working:


Part 1: Understanding a Single Neuron

What Is a Neuron?

In biological brains, neurons are cells that receive signals from other neurons, process them, and send signals forward. Artificial neurons work similarly, but with pure mathematics.

An artificial neuron performs two fundamental steps:

  1. Weighted Sum: It takes multiple inputs, multiplies each by a weight (showing importance), and adds them together with a bias term
  2. Activation: It passes this sum through an activation function to produce the final output

Let’s visualize this structure:


    ANATOMY OF AN ARTIFICIAL NEURON
    ═══════════════════════════════════════════════════════════════════

     INPUTS          WEIGHTS         WEIGHTED SUM      ACTIVATION    OUTPUT

                                                           β”Œβ”€β”€β”€β”€β”€β”
    x₁ = 0.8 ───→ Γ— w₁ = 0.5 ──┐                          β”‚     β”‚
                                β”‚                          β”‚     β”‚
    xβ‚‚ = 0.6 ───→ Γ— wβ‚‚ = 0.7 ──┼──→ Ξ£ = 1.07 ──→ z ───→ Οƒ(z) ──→ 0.76
                                β”‚     +0.10         +b     β”‚     β”‚
    x₃ = 0.9 ───→ Γ— w₃ = 0.3 β”€β”€β”˜      ────                β”‚     β”‚
                                       1.17                β””β”€β”€β”€β”€β”€β”˜
    bias b = 0.1 β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    ═══════════════════════════════════════════════════════════════════
    Formula:  output = Οƒ(w₁x₁ + wβ‚‚xβ‚‚ + w₃x₃ + b)
    Where:    Οƒ(z) = sigmoid activation = 1 / (1 + e⁻ᢻ)

A Practical Example: Should I Go Running?

Let’s make this concrete with a real-world decision: deciding whether to go for a run. We’ll use three factors:

  • Weather score (0 = terrible, 1 = perfect): 0.8
  • Energy level (0 = exhausted, 1 = energized): 0.6
  • Available time (0 = no time, 1 = plenty): 0.9

We’ll assign weights to show how much each factor matters:

We get a score of 1.17. But there’s a problem: what does 1.17 mean? Is that β€œyes, go running” or β€œno, stay home”? What if the score was 47.3 or -2.8? We need a way to interpret any number as a probability.

The Activation Function: Sigmoid

To convert any score into a probability between 0 and 1, we use an activation function. The sigmoid function is perfect for this:

1
Οƒ(z) = 1 / (1 + e^(-z))

The sigmoid function has useful properties:

  • Output range: Always between 0 and 1 (perfect for probabilities)
  • Smooth curve: Small changes in input produce small changes in output
  • Interpretable: 0.5 is the decision boundary (50% confidence)

Let’s see it in action:

Now 1.17 becomes 76% confidenceβ€”a clear β€œyes, go running!”

Visualizing the Sigmoid Function

Let’s plot the sigmoid function to understand its behavior:

The Complete Neuron Function

Now let’s package everything into a reusable neuron function:


Part 2: The Fundamental Limitation of a Single Neuron

The XOR Problem: A Classic Challenge

We’ve seen that a single neuron can make simple decisions. But there’s a famous problem in machine learning history that exposed a critical limitation: the XOR problem.

XOR (exclusive OR) is a simple logical operation:

  • Output 1 if inputs are different
  • Output 0 if inputs are the same

Here’s the truth table:


     XOR TRUTH TABLE
    ═══════════════════════════════════
     Input 1 β”‚ Input 2 β”‚ Output
    ─────────┼─────────┼─────────────────
        0    β”‚    0    β”‚   0     (same)
        0    β”‚    1    β”‚   1     (different)
        1    β”‚    0    β”‚   1     (different)
        1    β”‚    1    β”‚   0     (same)
    ═══════════════════════════════════

Let’s visualize this problem:

Why Can’t a Single Neuron Solve XOR?

A single neuron can only create a linear decision boundaryβ€”a straight line (in 2D) or a flat plane (in higher dimensions). But XOR requires a non-linear decision boundary to separate the classes.

Let’s see what happens when we try:

Visualizing the Linear Boundary Limitation

Let’s see exactly why a single neuron fails by plotting its decision boundary:

This limitationβ€”that a single neuron can only create linear boundariesβ€”is called linear separability. XOR is a linearly non-separable problem, and it’s everywhere in real-world data.

This is why we need neural networks with multiple neurons.


Part 3: Neural Networksβ€”Combining Neurons for Intelligence

The Key Insight: Multiple Neurons Create Non-Linear Boundaries

What if we use two neurons in a hidden layer, each creating their own linear boundary, and then combine their outputs with a third neuron? This creates the ability to solve non-linear problems.

Here’s the architecture:


    TWO-LAYER NEURAL NETWORK ARCHITECTURE
    ═════════════════════════════════════════════════════════════

    INPUT LAYER         HIDDEN LAYER          OUTPUT LAYER
                        (2 neurons)           (1 neuron)

                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    x₁ (Input 1) ────→│   Neuron A   │────┐
                  β”‚   β”‚  (learns OR) β”‚    β”‚
                  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚                       β”œβ”€β”€β†’β”‚   Output    │──→ Prediction
                  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚(combines A,B)β”‚
    xβ‚‚ (Input 2) ─┼──→│   Neuron B   β”‚β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚ (learns AND) β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    ═════════════════════════════════════════════════════════════
    How XOR is solved:
    β€’ Neuron A fires when: At least ONE input is 1  (OR logic)
    β€’ Neuron B fires when: BOTH inputs are 1        (AND logic)
    β€’ Output computes:     A is true BUT B is false (XOR logic)

Building a Two-Layer Network

Let’s implement this network:

Understanding What Each Hidden Neuron Learned

Testing the Complete XOR Solution

Visualizing the Non-Linear Decision Boundary


Part 4: Working with Batchesβ€”The Matrix View

Before we tackle learning, let’s clean up our code. Instead of processing one sample at a time, we can use matrix operations to process entire batches simultaneously.

Layer as a Matrix Operation

Batch Processing


Part 5: The Loss Functionβ€”Measuring How Wrong We Are

So far, we’ve used hand-picked weights. But how do we find good weights automatically? We need a way to measure how β€œwrong” our predictions are.

Mean Squared Error (MSE)

The most common loss function for this task is Mean Squared Error:

1
MSE = (1/n) Γ— Ξ£(prediction - target)Β²

    MEAN SQUARED ERROR (MSE) CALCULATION
    ══════════════════════════════════════════════════════════════════════

    Sample    Target    Prediction    Error         Squared Error
    ──────────────────────────────────────────────────────────────────────
     [0,0]      0         0.002       +0.002         0.000004
     [0,1]      1         0.998       -0.002         0.000004
     [1,0]      1         0.997       -0.003         0.000009
     [1,1]      0         0.003       +0.003         0.000009
    ──────────────────────────────────────────────────────────────────────
                                      Sum:           0.000026
                                      Average (Γ·4):  0.0000065  ← MSE

    ══════════════════════════════════════════════════════════════════════
    Low MSE  = Good predictions (errors close to 0) βœ“
    High MSE = Bad predictions  (errors far from 0) βœ—

Random Weights = High Loss

The goal of training: Start with random weights and iteratively adjust them to minimize the loss function, bringing predictions closer to targets.


Part 6: Backpropagationβ€”How Networks Learn

The Gradient Descent Intuition

Imagine the loss function as a mountainous landscape. Your current weights place you somewhere on this landscape, and the loss is your altitude. Training is the process of walking downhill to find the lowest point.

But how do you know which direction is downhill? That’s where gradients come in.


    GRADIENT DESCENT: Walking Downhill to Minimize Loss
    ═════════════════════════════════════════════════════════════

    Loss
      β–²
      β”‚
      β”‚    ●  Start: Random weights, High loss
      β”‚    β”‚
      β”‚    └──●  Step 1: Follow gradient down
      β”‚        β”‚
      β”‚        └──●  Step 2: Keep descending
      β”‚            β”‚
      β”‚            └──●  Step 3: Getting closer
      β”‚                β”‚
      β”‚                └──●  Step 4: Nearly there
      β”‚                    β”‚
      └────────────────────●────────────────────────────────→ Weights
                          β–²
                          Goal: Minimum loss (optimal weights)

    ═════════════════════════════════════════════════════════════
    Gradient = Direction of steepest INCREASE in loss
    We move OPPOSITE to gradient = Go DOWNHILL = Reduce loss

Computing Gradients: The Chain Rule

For each weight, we need to know: β€œIf I change this weight slightly, how much does the loss change?”

This is a derivative, and because our network is a chain of functions (input β†’ hidden β†’ output β†’ loss), we use the chain rule to compute it. This process is called backpropagation because we compute gradients by working backward from the output.

Implementation: The Backward Pass

Propagating Back to Hidden Layer


Part 7: Complete Trainingβ€”Watch the Network Learn

Now let’s put everything together: forward pass, loss calculation, backpropagation, and weight updates. We’ll train a network from scratch and watch it learn to solve XOR.

The Full Neural Network Class

Training from Scratch

Visualizing the Learning Process


Part 8: From NumPy to Real Frameworks

Everything we’ve builtβ€”forward passes, backpropagation, gradient descentβ€”is exactly what modern deep learning frameworks do. They just do it faster, on GPUs, with more features.

The Keras Equivalent

Our 60 lines of NumPy code is equivalent to this in Keras:

1
2
3
4
5
6
7
8
9
10
11
12
13
from tensorflow import keras

# Define architecture
model = keras.Sequential([
    keras.layers.Dense(4, activation='sigmoid', input_shape=(2,)),
    keras.layers.Dense(1, activation='sigmoid')
])

# Specify optimizer and loss
model.compile(optimizer='sgd', loss='mse')

# Train
model.fit(X, y, epochs=3000, verbose=0)

The concepts are identical:

  • Layers: Dense layers = our matrix multiplications + activation
  • Loss: MSE = our loss function
  • Optimizer: SGD (stochastic gradient descent) = our weight updates
  • Fit: Training loop = our forward/backward passes

Conclusion: What You’ve Learned

Congratulations! You’ve built a complete neural network from scratch and understand:

Core Concepts

  1. Neurons compute weighted sums then apply activation functions
  2. Sigmoid activation converts any value to a probability (0-1)
  3. Single neurons create linear boundaries (can’t solve XOR)
  4. Multiple neurons create non-linear boundaries (can solve XOR)
  5. Matrix operations enable efficient batch processing
  6. Loss functions quantify prediction error
  7. Backpropagation computes gradients via the chain rule
  8. Gradient descent iteratively minimizes loss

The Big Picture


    THE COMPLETE DEEP LEARNING TRAINING CYCLE
    ═══════════════════════════════════════════════════════════════════

         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  STEP 1: INITIALIZATION                            β”‚
         β”‚  β€’ Create random weights and biases                β”‚
         β”‚  β€’ Network knows nothing yet                       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  STEP 2: FORWARD PASS                              β”‚
         β”‚  β€’ Input β†’ Hidden layers β†’ Output                  β”‚
         β”‚  β€’ Compute: activation = Οƒ(weights Γ— input + bias) β”‚
         β”‚  β€’ Generate predictions                            β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  STEP 3: COMPUTE LOSS                              β”‚
         β”‚  β€’ Compare predictions to true targets             β”‚
         β”‚  β€’ Calculate error: MSE = mean((pred - target)Β²)   β”‚
         β”‚  β€’ Quantify how wrong we are                       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  STEP 4: BACKPROPAGATION                           β”‚
         β”‚  β€’ Compute gradients using chain rule              β”‚
         β”‚  β€’ Find βˆ‚Loss/βˆ‚Weight for every weight             β”‚
         β”‚  β€’ Determine how to adjust each weight             β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  STEP 5: GRADIENT DESCENT (Update Weights)         β”‚
         β”‚  β€’ weight_new = weight_old - learning_rate Γ— grad  β”‚
         β”‚  β€’ Take small step downhill on loss landscape      β”‚
         β”‚  β€’ Network gets slightly better                    β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            ↓
                    ╔═══════════════╗
                    β•‘  Repeat 2-5   β•‘
                    β•‘  for 1000s of β•‘  ← Training loop
                    β•‘    epochs     β•‘
                    β•šβ•β•β•β•β•β•β•β•¦β•β•β•β•β•β•β•β•
                            β”‚
                            ↓
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   βœ“ DONE!     β”‚
                    β”‚ Trained model β”‚
                    β”‚  ready to use β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    ═══════════════════════════════════════════════════════════════════

What Comes Next

Everything in modern deep learning builds on these foundations:

  • CNNs (Convolutional Neural Networks): Specialized layers for images
  • RNNs (Recurrent Neural Networks): Handle sequences (text, time series)
  • Transformers: Attention mechanisms for language models (GPT, BERT)
  • Regularization: Techniques to prevent overfitting (dropout, batch norm)
  • Optimizers: Better than vanilla gradient descent (Adam, RMSprop)

But at their core, they all use the same principles you’ve learned:

  • Neurons computing weighted sums
  • Activation functions introducing non-linearity
  • Backpropagation computing gradients
  • Gradient descent minimizing loss

You now understand the foundation of artificial intelligence.


πŸ’¬ Questions About This Article?

Ask anything about neural networks, neurons, backpropagation, or the concepts covered in this tutorial. The RAG system will answer based on the article content.

⏳ Initializing RAG system...

References

  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). β€œLearning representations by back-propagating errors” - The original backpropagation paper
  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). β€œDeep learning” (Nature) - Comprehensive overview by the pioneers
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). β€œDeep Learning” (MIT Press) - The definitive textbook
  • Nielsen, M. A. (2015). β€œNeural Networks and Deep Learning” - Excellent free online book