Convolution: from first principles to modern deep learning - Part 1

when people first meet convolutions, it often feels like a strange formula:

y[n] = \sum_{k} x[k] \cdot h[n - k]

Then later, in deep learning (explained in part 2), they see Conv1d implemented more like:

y[n] = \sum_{r=0}^{K-1} w[r] \cdot x[n + r]

and hear that this is actually cross-correlation, not strict convolution.

So what is really going on?

This post builds the idea carefully from the ground up.

We will go through:

What a signal is
What a 1D discrete signal is
What a system is
Linearity
Time invariance
The unit impulse
Signals as a sum of scaled shifted impulses
The impulse response
Convolution (sum of scaled impulse responses)
The flip — where it comes from
Real world example (Audio reverb)

A signal

A signal is any function that conveys information about the behavior of a certain phenomenon. In signal processing, this is mathematically represented as a variation of a quantity with respect to time, space, or another independent variable. It can be mathematically represented as $x(t)$ where $t$ , represent the independent variable (usually time), the value of the function at any time represent the amplitude or intensity of the signal at that time.

Signals, can be represented as continuous or discrete, for $x(t)$ this is a continuous signal whereby time is continuous and there’s a value at every position of the time.

The discrete signal is represented by sampling the continuous signal at a given interval of $nT$ where $T$ is a fixed sampling period, and $n \in \mathbb{Z}$ This defines the discrete signal as a mapping $x: \mathbb{Z} \to \mathbb{R}$ thereby, we can represent the function as $x[n]=x(nT)$ .

You can think of a signal as carriers of information. Anything that changes over time or space to tell us something about the world, is a signal. a simpler example is an acoustic signal, whereby the signal can be the variation of pressure caused by the vocal cords or an instruments. when you record a speech, a microphone converts these pressure waves into an analog voltage (a continuous-time signal, $x(t)$ ), which is then sampled by an ADC (Analog-to-Digital Converter) into a digital sequence (a discrete-time signal, $x[n]$ ). In this blog, we will mostly work with discrete signal.

A concrete example of a discrete signal can be given as this

$x[0]=2,x[1]=5,x[2]=1,x[3]=0$

We can write this as the sequence:

$x=[2,5,1,0]$ and can be represented visually on the graph below

hover to inspect

x[n] = [2, 5, 1, 0]support: n = 0 … 3length: 4

We can see from the graph that while we only list the values of interest, the signal is technically defined for all $n \in \mathbb{Z}$ . For any index not shown in our sequence (like $n = -1$ or $n = 4$ ), we can assume the amplitude is zero. Mathematically, we say this signal has finite support (i.e outside these four values, the signal is zero everywhere).

Shifting a Discrete Signal

if we have a signal $x[n]$ , we can perform shift operator $S_m$ on the signal, which is defined as $(S_mx)[n] = x[n-m] = y[n]$ , then $y$ is just the signal $x$ shifted by $m$ . if $m>0$ , then the signal is shifted to the right, otherwise, it’s shifted to the left.

Visual Representation

(S_m x)[n] = x[n − m]

Shift m = 0

No shift — both rows are identical

x[n]   —   original signal

apply S_m → replace n with (n − m)

x[n]   (m = 0, no shift)

(S_m x)[n] = x[n − m]m > 0 → right shiftm < 0 → left shiftm = 0

Why the sign feels backwards

When you see $x[n - m]$ with $m > 0$ , instinct says “subtracting means moving left.” But the opposite happens — the signal shifts right.

Here is why. The expression $x[n - m]$ asks: “give me the value that used to live at position $n - m$ .” At $n = 3$ with $m = 2$ , you are reading $x[1]$ — a value from earlier in the sequence. So the same value that lived at $n = 1$ is now being read at $n = 3$ . It moved two steps to the right.

Think of it this way: you are not moving the value — you are moving the reading head left by $m$ , which makes the waveform appear to slide right. Subtracting from the input index delays the signal, and delay is rightward on the $n$ -axis.
$m > 0 \implies \text{shift right} \qquad m < 0 \implies \text{shift left}$

System

A system is a transformational rule, that takes one input signal $x[n]$ , and produce an output signal $y[n]$ .

We write it as:

$y=T(x)$

or equivalently $y[n]=(Tx)[n]$ .

A common example of a system is an amplifier, whereby your audio signal get amplified by a factor (can be thought of as volume), mathematically can be represented like this:

assuming you record an audio represented by a signal $x[n]$ , a linear amplifier (the system) can produce an output that is $(Tx)[n] = kx[n]$ as loud, that is just like increasing or decreasing the loudness of your speech.

some examples of a systems applied to $x[n]$ :

1. Scaling: $(Tx)[n] = 3x[n]$ — multiply every sample by 3, amplitudes scale up but positions stay the same.

— Scalingmultiply every sample by 3

input   x[n]

T

(Tx)[n] = 3·x[n]

output   (Tx)[n] = 3·x[n]

system:  (Tx)[n] = 3·x[n]input length:  6output length:  6

2. Delay: $(Tx)[n] = x[n-2]$ — shift the entire signal two steps to the right.

— Delayshift entire signal right by 2

input   x[n]

T

(Tx)[n] = x[n − 2]

output   (Tx)[n] = x[n − 2]

system:  (Tx)[n] = x[n − 2]input length:  6output length:  6

3. Compression: $(Tx)[n] = x[2n]$ — read every second sample, squashing the signal in time.

— Compressionread every 2nd sample → squashes the signal

input   x[n]

T

(Tx)[n] = x[2n]

output   (Tx)[n] = x[2n]

system:  (Tx)[n] = x[2n]input length:  6output length:  3⚠ only even-indexed samples survive

Linearity in a systems

when we say a system is linear, then it must respect two conditions: additivity and homogeneity:

1. Additivity (The Summation Rule)

If you have two different signals, $x_1(t)$ and $x_2(t)$ , and you feed them into the system separately, they produce $y_1(t)$ and $y_2(t)$ .

If the system is additive, feeding the sum of those signals into the system will produce the sum of the individual outputs.

$T\{x_1(t) + x_2(t)\} = y_1(t) + y_2(t)$

This can be shown visually below — first with a system that passes the test, then one that fails:

Additive system — $T\{x[n]\} = 2 \cdot x[n]$

T{x₁+x₂} =? y₁+y₂

Step 1 — feed x₁ through T alone

Additive system

T{x[n]} = 2·x[n]

↓ apply T : T{x[n]} = 2·x[n]

x₁ = [1, 2, 1, 0]x₂ = [0, 1, 2, 1]x₁ + x₂ = [1, 3, 3, 1]

Non-additive system — $T\{x[n]\} = x[n]^2$

T{x₁+x₂} =? y₁+y₂

Step 1 — feed x₁ through T alone

Non-additive system

T{x[n]} = x[n]²

↓ apply T : T{x[n]} = x[n]²

x₁ = [1, 2, 1, 0]x₂ = [0, 1, 2, 1]x₁ + x₂ = [1, 3, 3, 1]

2. Homogeneity (The Scaling Rule)

This rule states that if you scale (multiply) the input by a constant factor $a$ , the output is scaled by that same factor. There are no “surprises” or exponential jumps.

$T\{a \cdot x(t)\} = a \cdot T\{x(t)\}$

T{a·x} =? a·T{x}

Scale factora = 2Drag to scale — watch if both paths stay equal

Linear systemT{x[n]} = 2·x[n]

PATH 1 — scale input first, then apply T

= for all a

PATH 2 — apply T first, then scale output

✓  T{2·x[n]} = 2·T{x[n]}  — homogeneous

x[n] = [1, 2, 1, 0]a = 2

T{a·x} =? a·T{x}

Scale factora = 2Drag to scale — watch if both paths stay equal

Non-linear systemT{x[n]} = x[n]²

PATH 1 — scale input first, then apply T

≠ diverges with a

PATH 2 — apply T first, then scale output

✗  T{2·x[n]} ≠ 2·T{x[n]}  — NOT homogeneous

x[n] = [1, 2, 1, 0]a = 2

In other word, if a system can pass this test, then it’s a Linear System

$T\{a \cdot x_1(t) + b \cdot x_2(t)\} = a \cdot y_1(t) + b \cdot y_2(t)$

Timeline

Linearity allows us to use Decomposition. Since we know how a linear system treats a single point, we can break any complex signal down into those simple parts, see how the system handles them, and then add them back together at the end.

Time Invariance

In plain English, a system is Time-Invariant if its behavior doesn’t change over time. If you give the system an input today, or you give it that same input tomorrow, the output should be identical just shifted in time.

Imagine you are testing an audio system (like an echo effect):

Step 1: You feed the system a signal $x[n]$ and record the output $y[n]$ .
Step 2: You wait a few seconds (a shift of $m$ ) and feed it the exact same signal $x[n-m]$ .
Step 3: You check the output. If the system is Time-Invariant, the new output will be exactly the same as the first one, just delayed by $m$ : $y[n-m]$ .

Mathematically, A system $H$ is time-invariant if shifting the input causes the output to shift in the same way.

$H(S_m x) = S_m(H x)$

Left side: Shift the input first, then put it through the system.
Right side: Put the input through the system first, then shift the output.

If these two are equal, then we say the system is Time-Invariant. Example,

The Echo Effect: If you clap your hands in a large hall at 2:00 PM, you hear an echo. If you clap your hands exactly the same way at 5:00 PM, the hall produces the exact same echo. The room’s “rules” for bouncing sound didn’t change between 2:00 and 5:00.

Another way to see how time invariant might work visually is shown below,

Timeline

Shift m = 2

Timeline

Shift m = 2

The Unit Impulse

Now we introduce one of the most important signal in this blog: the impulse.

The discrete impulse is giving by the formula:

\delta[n] = \begin{cases} 1, & n = 0 \\ 0, & n \neq 0 \end{cases}

We can similarly define a shifted impulse by the below formula:

\delta[n-k] = \begin{cases} 1, & n = k \\ 0, & n \neq k \end{cases}

k = 2-4…5

Signals as a sum of scaled shifted impulses

This is one of the most foundational identities in signal processing. By representing a complete signal as a sum of shifted impulses, $\delta[n-k]$ , each scaled by its amplitude $x[k]$ , we can completely decompose a signal into its simplest parts and then reconstruct it perfectly.

impulse_decompositionx[n] = Σ x[k]·δ[n−k]

impulses at origin — all at n = 0

x[n] = [1, 0.5, 2.4]1·δ[n−0]0.5·δ[n−1]2.4·δ[n−2]

Mathematically, this can be represented as:

$x[n] = \sum_{k=-\infty}^{\infty} x[k] \delta[n-k]$

Because of this identity, we don’t need to analyze how a system reacts to every possible signal $x[n]$ that could ever exist. We only need to analyze how it reacts to $\delta[n]$ .

If the system is Linear, we can scale and add the responses.
If the system is Time-Invariant, the response to a shifted impulse $\delta[n-k]$ is just a shifted version of the original response.

The impulse response

The Impulse Response of a system is the output produced when the input is a single unit impulse, $\delta[n]$ .

$h[n] = \mathcal{T}\{\delta[n]\}$

You can think of it as the system’s fingerprint. Imagine striking a bell with a hammer. The sharp, instantaneous hit is like the impulse $\delta[n]$ . The ringing sound that follows is the impulse response $h[n]$ . Every bell has its own characteristic ring: some are deep, some are bright, some sustain for a long time, others decay quickly. That ringing pattern reveals the essential physical behavior of the bell.

In the same way, the impulse response reveals the essential behavior of any systems.

The LTI Contract

We have already seen that any discrete-time signal can be written as a sum of shifted and scaled impulses:

$x[n] = \sum_{k=-\infty}^{\infty} x[k] \delta[n-k]$

This decomposition becomes extremely powerful when the system is Linear and Time-Invariant (LTI).

Time-invariance means the system responds the same way no matter when the impulse occurs. If the impulse is shifted to index $k$ , the response is simply a shifted copy of the original impulse response:

$\mathcal{T}\{\delta[n-k]\} = h[n-k]$

impulse_response_plotT{δ[n−k]} = h[n−k]

shift k0drag → δ[n−k] and h[n−k] slide together

h[n] = [1, 0.7, 0.45, 0.25, 0.12, 0.05]impulse at k=0 → response h[n−0] starts at n=0

We can see from the above animation, as we shift the impulse $\delta[n]$ by $k$ ( $\delta[n-k]$ ) the impulse response $h[n]$ shift exactly the same by $k$ ( $h[n-k]$ ), nothing else changes, this is because the system is LTI.

Linearity means scaling and summing inputs scales and sums the outputs in exactly the same way. If an impulse is multiplied by an amplitude $x[k]$ , its response is multiplied by that same amplitude.

Together, these two properties means we do not need to test the system with every possible signal. Since every signal is built from impulses, and since we know exactly how the system responds to one impulse, we can construct the response to any signal by summing the responses to all of its component impulses.

Convolution (Sum of scaled shifted impulse responses)

We now derive the convolution sum from first principles. The key idea is simple:

Any signal can be written as a sum of scaled, shifted impulses.
A linear system lets us handle each impulse contribution separately, then add the results together.
A time-invariant system responds to a shifted impulse by shifting the impulse response.

Puting these facts together force the convolution formula. Let’s dervive it step by step:

Decompose the input into shifted impulses

Any discrete-time signal can be written as:

$x[n] = \sum_{k=-\infty}^{\infty} x[k]\delta[n-k]$

This means the signal is built from an impulse at every location $k$ scaled by the amplitude $x[k]$ . So each sample $x[k]$ can be viewed as a scaled impulse $x[k]\delta[n-k]$ .

Apply the system to the whole signal

Let the system be $\mathcal{T}$ , and let the output be $y[n]$ . Then:

$y[n] = \mathcal{T}\{x[n]\}$

Substitute the impulse decomposition of $x[n]$ :

$y[n] = \mathcal{T}\left\{ \sum_{k=-\infty}^{\infty} x[k]\delta[n-k] \right\}$

Use Linearity

Because the system is linear, it preserves sums and scalar multiplication. So we can move the system operator $\mathcal{T}$ inside the summation:

$y[n] = \sum_{k=-\infty}^{\infty} \mathcal{T}\{x[k]\delta[n-k]\}$

And since the amplitude $x[k]$ is just a scalar constant with respect to the system, we can pull it out:

$y[n] = \sum_{k=-\infty}^{\infty} x[k]\mathcal{T}\{\delta[n-k]\}$

So this means each input sample $x[k]$ scales the system’s response to an impulse at $k$ .

Define the Impulse Response

Now define the impulse response of the system as:

$h[n] = \mathcal{T}\{\delta[n]\}$

This is the output produced when the input is a unit impulse at index 0.

Use Time-Invariance

Because the system is time-invariant, shifting the input impulse by $k$ shifts the output by the exact same amount. Since $\delta[n-k]$ is just the impulse shifted to index $k$ :

$\mathcal{T}\{\delta[n-k]\} = h[n-k]$

So the response to an impulse at $k$ is just a shifted copy of the impulse response.

Combine Both Results

Substitute this time-invariant property into the expression:

$y[n] = \sum_{k=-\infty}^{\infty} x[k]\mathcal{T}\{\delta[n-k]\}$

$y[n] = \sum_{k=-\infty}^{\infty} x[k]h[n-k]$

We obtain the final result:

$\boxed{ y[n] = \sum_{k=-\infty}^{\infty} x[k]h[n-k] }$

This is the Convolution Sum.

The Intuition Behind the Formula

Each input sample $x[k]$ :

Behaves like a scaled impulse at location $k$ .
Launches a shifted copy of the impulse response $h[n-k]$ .
Contributes that shifted response to the total output.

The final output is the sum of all these shifted, scaled copies:

$y[n] = \sum_k x[k]h[n-k]$

So, convolution can be understood simply as the sum of scaled, shifted impulse responses.

Full Proof

Decompose the signal:

$x[n] = \sum_k x[k]\delta[n-k]$

Apply the system:

$\mathcal{T}\{x[n]\} = \mathcal{T}\left\{\sum_k x[k]\delta[n-k]\right\}$

By linearity:

$y[n] = \sum_k x[k]\mathcal{T}\{\delta[n-k]\}$

By time invariance, if $h[n]=\mathcal{T}\{\delta[n]\}$ , then $\mathcal{T}\{\delta[n-k]\}=h[n-k]$ . Therefore:

$y[n] = \sum_k x[k]h[n-k]$

Which is the convolution sum.

In one sentence: Convolution is just the sum of all the shifted impulse responses generated by each input sample, scaled by that sample’s amplitude.

Visualization of Convolution

convolution_ploty[n] = x[n] ∗ h[n]

input x[n] and system h[n]

x[n] = [1, 2, 0.6, 0.9]h[n] = [1.5, 0.5]y[n] = [1.5, 3.5, 1.9, 1.65, 0.45]

Each colored row is one term of the sum: the input value $x[k]$ scales a copy of $h[n]$ , and that copy is time-shifted to start at position $k$ . The output $y[n]$ at any index $n$ is the vertical sum of every row’s value at that $n$ — in the animation, the rows literally collapse downward into $y[n]$ .

The Flip: Where It Comes From

Most explanations present the flipped kernel as a rule you have to memorize. It is not. It falls directly out of the convolution sum, and once you see why, it never feels arbitrary again.

Recall the convolution sum:

y[n] = \sum_{k=0}^{N-1} x[k] \cdot h[n - k]

Fix $n$ at some value — say $n = 2$ — and read the argument $h[n - k]$ as $k$ increases from $0, 1, 2, \ldots$

k = 0 \;\Rightarrow\; h[2 - 0] = h[2]

k = 1 \;\Rightarrow\; h[2 - 1] = h[1]

k = 2 \;\Rightarrow\; h[2 - 2] = h[0]

The argument $n - k$ runs backwards as $k$ increases. You are reading $h$ at indices $2, 1, 0$ — that is $h$ in reverse order, sitting over the input window $x[0], x[1], x[2]$ .

That reversed reading is the flipped kernel. Nobody chose to flip it. The subtraction $n - k$ flips it automatically.

The vertical slice picture

Think of laying out all the shifted impulse responses as rows in a table:

row $k$	values at each $n$
$k=0$	$x[0] \cdot h[n]$
$k=1$	$x[1] \cdot h[n-1]$
$k=2$	$x[2] \cdot h[n-2]$

Now slice vertically at a fixed $n$ . Reading downward through column $n$ :

\text{row 0}: \; h[n-0], \quad \text{row 1}: \; h[n-1], \quad \text{row 2}: \; h[n-2]

The second argument steps backwards: $n,\; n-1,\; n-2,\;\ldots$ You are reading $h$ from its latest value back to its earliest — exactly a reversed $h$ aligned over the input at position $n$ .

Why the sliding view needs the flip

In the sliding kernel view, the kernel window moves left to right over the input. At position $n$ , the leftmost box of the window touches $x[n]$ and the rightmost touches $x[n - M + 1]$ — the oldest input in the window.

For the products to match the convolution sum, $h[0]$ must multiply the newest input ( $x[n]$ ) and $h[M-1]$ must multiply the oldest. But if you lay $h$ left to right without flipping, $h[0]$ ends up on the oldest input. Flipping $h$ corrects the alignment so the products are identical to what the impulse sum computes.

One sentence: the flip is the geometric consequence of the minus sign in $h[n - k]$ , and that minus sign is what makes later inputs subtract from the current time index — which is the definition of a causal delay.

Visual Representation

convolution_table.jsxbracket + sliding view

n = 0 → y[0] = 1.00

x = [1, 2, 1, 3, 2]h = [1, 0.5, 0.25]h_flip = [0.25, 0.5, 1]output length = 7

How to Read This Animation

The cyan bracket in the top half and the kernel boxes in the bottom half are always showing you the same thing — just from two different angles. Every time you press Next, both move together by one step. The products listed on the right of the table are identical to the products shown inside the kernel boxes. The output $y[n]$ at the bottom right of the sliding view matches the green cell in the sum row. They are not two different calculations. They are two ways of reading the same table.

Two Ways to Think About Convolution

Way 1 — The Impulse Response View (read by row)

This is the original physical intuition. You feed the system one input sample at a time. Each sample $x[k]$ fires the system, and the system responds with a scaled, shifted copy of its impulse response $h$ :

x[k] \;\longrightarrow\; x[k] \cdot h[n - k]

So $x[0]$ fires $h$ at position 0 with height $x[0]$ .
Then $x[1]$ fires $h$ at position 1 with height $x[1]$ .
Then $x[2]$ fires $h$ at position 2 with height $x[2]$ .
And so on.

The output $y[n]$ is what you get when all those echoes land on top of each other and add. In the table, each row is one echo. Reading across a row you see where that echo travels. Reading down a column you see every echo that has landed at that moment in time.

y[n] = \sum_{k=0}^{N-1} x[k] \cdot h[n - k]

This view answers the question: “which input samples contributed to this output?”

Way 2 — The Sliding Kernel View (read by column)

This is the computational view. Instead of thinking about firing echoes, you slide a small window across the input and compute a dot product at each position.

At output position $n$ , the flipped kernel $h_{\text{flip}}$ sits over $n$ consecutive input samples. You multiply each kernel weight by the input value directly beneath it, then sum. That sum is $y[n]$ .

y[n] = \sum_{j=0}^{M-1} h_{\text{flip}}[j] \cdot x[n + j - (M-1)]

This view answers the question: “what input values is the kernel currently looking at?”

The Flip — One Last Time

Notice that $h_{\text{flip}} = [h[M-1], \ldots, h[1], h[0]]$ — $h$ read backwards. Why?

In the impulse view, row $k$ contributes $h[n-k]$ to column $n$ . As $k$ increases from $0$ to $M-1$ down the column, the argument $n - k$ decreases from $n$ to $n - M + 1$ . You are reading $h$ backwards through the column.

In the sliding view, the leftmost kernel box touches the oldest input in the window, and the rightmost touches the newest. For the products to match the impulse view, $h[0]$ must be paired with the newest input — which means it must sit on the right of the kernel window. Flipping $h$ puts $h[0]$ on the right.

The flip is not an arbitrary convention. It is the geometric consequence of the minus sign in $h[n - k]$ .