← Back to Blog
Signal Processing

Convolution: from first principles to modern deep learning - Part 1

Bello Tukur · April 29, 2026 · 18 min read
Starting from the definition of Signal Processing and linear shift-invariant systems, we derive the discrete convolution operation from scratch — with a live interactive simulation.

when people first meet convolutions, it often feels like a strange formula:

y[n]=kx[k]h[nk]y[n] = \sum_{k} x[k] \cdot h[n - k]

Then later, in deep learning (explained in part 2), they see Conv1d implemented more like:

y[n]=r=0K1w[r]x[n+r]y[n] = \sum_{r=0}^{K-1} w[r] \cdot x[n + r]

and hear that this is actually cross-correlation, not strict convolution.

So what is really going on?

This post builds the idea carefully from the ground up.

We will go through:

  • What a signal is
  • What a 1D discrete signal is
  • What a system is
  • Linearity
  • Time invariance
  • The unit impulse
  • Signals as a sum of scaled shifted impulses
  • The impulse response
  • Convolution (sum of scaled impulse responses)
  • The flip — where it comes from
  • Real world example (Audio reverb)

A signal

A signal is any function that conveys information about the behavior of a certain phenomenon. In signal processing, this is mathematically represented as a variation of a quantity with respect to time, space, or another independent variable. It can be mathematically represented as x(t)x(t) where tt, represent the independent variable (usually time), the value of the function at any time represent the amplitude or intensity of the signal at that time.

Signals, can be represented as continuous or discrete, for x(t)x(t) this is a continuous signal whereby time is continuous and there’s a value at every position of the time.

The discrete signal is represented by sampling the continuous signal at a given interval of nTnT where TT is a fixed sampling period, and nZn \in \mathbb{Z} This defines the discrete signal as a mapping x:ZRx: \mathbb{Z} \to \mathbb{R} thereby, we can represent the function as x[n]=x(nT)x[n]=x(nT).

You can think of a signal as carriers of information. Anything that changes over time or space to tell us something about the world, is a signal. a simpler example is an acoustic signal, whereby the signal can be the variation of pressure caused by the vocal cords or an instruments. when you record a speech, a microphone converts these pressure waves into an analog voltage (a continuous-time signal, x(t)x(t)), which is then sampled by an ADC (Analog-to-Digital Converter) into a digital sequence (a discrete-time signal, x[n]x[n]). In this blog, we will mostly work with discrete signal.

A concrete example of a discrete signal can be given as this

x[0]=2,x[1]=5,x[2]=1,x[3]=0x[0]=2,x[1]=5,x[2]=1,x[3]=0

We can write this as the sequence:

x=[2,5,1,0]x=[2,5,1,0] and can be represented visually on the graph below

hover to inspect
x[n] = [2, 5, 1, 0]support: n = 03length: 4

We can see from the graph that while we only list the values of interest, the signal is technically defined for all nZn \in \mathbb{Z}. For any index not shown in our sequence (like n=1n = -1 or n=4n = 4), we can assume the amplitude is zero. Mathematically, we say this signal has finite support (i.e outside these four values, the signal is zero everywhere).

Shifting a Discrete Signal

if we have a signal x[n]x[n], we can perform shift operator SmS_m on the signal, which is defined as (Smx)[n]=x[nm]=y[n](S_mx)[n] = x[n-m] = y[n], then yy is just the signal xx shifted by mm. if m>0m>0, then the signal is shifted to the right, otherwise, it’s shifted to the left.

Visual Representation

(S_m x)[n] = x[n − m]
No shift — both rows are identical
x[n] — original signal
apply S_m  →  replace n with (n − m)
x[n] (m = 0, no shift)
(S_m x)[n] = x[n − m]m > 0 → right shiftm < 0 → left shiftm = 0

Why the sign feels backwards

When you see x[nm]x[n - m] with m>0m > 0, instinct says “subtracting means moving left.” But the opposite happens — the signal shifts right.

Here is why. The expression x[nm]x[n - m] asks: “give me the value that used to live at position nmn - m.” At n=3n = 3 with m=2m = 2, you are reading x[1]x[1] — a value from earlier in the sequence. So the same value that lived at n=1n = 1 is now being read at n=3n = 3. It moved two steps to the right.

Think of it this way: you are not moving the value — you are moving the reading head left by mm, which makes the waveform appear to slide right. Subtracting from the input index delays the signal, and delay is rightward on the nn-axis.

m>0    shift rightm<0    shift leftm > 0 \implies \text{shift right} \qquad m < 0 \implies \text{shift left}

System

A system is a transformational rule, that takes one input signal x[n]x[n], and produce an output signal y[n]y[n].

We write it as:

y=T(x)y=T(x)

or equivalently y[n]=(Tx)[n]y[n]=(Tx)[n].

A common example of a system is an amplifier, whereby your audio signal get amplified by a factor (can be thought of as volume), mathematically can be represented like this:

assuming you record an audio represented by a signal x[n]x[n], a linear amplifier (the system) can produce an output that is (Tx)[n]=kx[n](Tx)[n] = kx[n] as loud, that is just like increasing or decreasing the loudness of your speech.

some examples of a systems applied to x[n]x[n]:

1. Scaling: (Tx)[n]=3x[n](Tx)[n] = 3x[n] — multiply every sample by 3, amplitudes scale up but positions stay the same.

Scalingmultiply every sample by 3
input x[n]
T
(Tx)[n] = 3·x[n]
output (Tx)[n] = 3·x[n]
system:  (Tx)[n] = 3·x[n]input length:  6output length:  6

2. Delay: (Tx)[n]=x[n2](Tx)[n] = x[n-2] — shift the entire signal two steps to the right.

Delayshift entire signal right by 2
input x[n]
T
(Tx)[n] = x[n − 2]
output (Tx)[n] = x[n − 2]
system:  (Tx)[n] = x[n − 2]input length:  6output length:  6

3. Compression: (Tx)[n]=x[2n](Tx)[n] = x[2n] — read every second sample, squashing the signal in time.

Compressionread every 2nd sample → squashes the signal
input x[n]
T
(Tx)[n] = x[2n]
output (Tx)[n] = x[2n]
system:  (Tx)[n] = x[2n]input length:  6output length:  3⚠ only even-indexed samples survive

Linearity in a systems

when we say a system is linear, then it must respect two conditions: additivity and homogeneity:

1. Additivity (The Summation Rule)

If you have two different signals, x1(t)x_1(t) and x2(t)x_2(t), and you feed them into the system separately, they produce y1(t)y_1(t) and y2(t)y_2(t).

If the system is additive, feeding the sum of those signals into the system will produce the sum of the individual outputs.

T{x1(t)+x2(t)}=y1(t)+y2(t)T\{x_1(t) + x_2(t)\} = y_1(t) + y_2(t)

This can be shown visually below — first with a system that passes the test, then one that fails:

Additive systemT{x[n]}=2x[n]T\{x[n]\} = 2 \cdot x[n]

T{x₁+x₂} =? y₁+y₂
Step 1 — feed x₁ through T alone
Additive system
T{x[n]} = 2·x[n]
↓ apply T : T{x[n]} = 2·x[n]
x₁ = [1, 2, 1, 0]x₂ = [0, 1, 2, 1]x₁ + x₂ = [1, 3, 3, 1]

Non-additive systemT{x[n]}=x[n]2T\{x[n]\} = x[n]^2

T{x₁+x₂} =? y₁+y₂
Step 1 — feed x₁ through T alone
Non-additive system
T{x[n]} = x[n]²
↓ apply T : T{x[n]} = x[n]²
x₁ = [1, 2, 1, 0]x₂ = [0, 1, 2, 1]x₁ + x₂ = [1, 3, 3, 1]

2. Homogeneity (The Scaling Rule)

This rule states that if you scale (multiply) the input by a constant factor aa, the output is scaled by that same factor. There are no “surprises” or exponential jumps.

T{ax(t)}=aT{x(t)}T\{a \cdot x(t)\} = a \cdot T\{x(t)\}

T{a·x} =? a·T{x}
Drag to scale — watch if both paths stay equal
Linear systemT{x[n]} = 2·x[n]
PATH 1 — scale input first, then apply T
= for all a
PATH 2 — apply T first, then scale output
✓ T{2·x[n]} = 2·T{x[n]} — homogeneous
x[n] = [1, 2, 1, 0]a = 2
T{a·x} =? a·T{x}
Drag to scale — watch if both paths stay equal
Non-linear systemT{x[n]} = x[n]²
PATH 1 — scale input first, then apply T
≠ diverges with a
PATH 2 — apply T first, then scale output
✗ T{2·x[n]} ≠ 2·T{x[n]} — NOT homogeneous
x[n] = [1, 2, 1, 0]a = 2

In other word, if a system can pass this test, then it’s a Linear System

T{ax1(t)+bx2(t)}=ay1(t)+by2(t)T\{a \cdot x_1(t) + b \cdot x_2(t)\} = a \cdot y_1(t) + b \cdot y_2(t)

Timeline
Timeline

Linearity allows us to use Decomposition. Since we know how a linear system treats a single point, we can break any complex signal down into those simple parts, see how the system handles them, and then add them back together at the end.

Time Invariance

In plain English, a system is Time-Invariant if its behavior doesn’t change over time. If you give the system an input today, or you give it that same input tomorrow, the output should be identical just shifted in time.

Imagine you are testing an audio system (like an echo effect):

  1. Step 1: You feed the system a signal x[n]x[n] and record the output y[n]y[n].

  2. Step 2: You wait a few seconds (a shift of mm) and feed it the exact same signal x[nm]x[n-m].

  3. Step 3: You check the output. If the system is Time-Invariant, the new output will be exactly the same as the first one, just delayed by mm: y[nm]y[n-m].

Mathematically, A system HH is time-invariant if shifting the input causes the output to shift in the same way.

H(Smx)=Sm(Hx)H(S_m x) = S_m(H x)

  • Left side: Shift the input first, then put it through the system.

  • Right side: Put the input through the system first, then shift the output.

If these two are equal, then we say the system is Time-Invariant. Example,

The Echo Effect: If you clap your hands in a large hall at 2:00 PM, you hear an echo. If you clap your hands exactly the same way at 5:00 PM, the hall produces the exact same echo. The room’s “rules” for bouncing sound didn’t change between 2:00 and 5:00.

Another way to see how time invariant might work visually is shown below,

Timeline
Shift m = 2
Timeline
Shift m = 2

The Unit Impulse

Now we introduce one of the most important signal in this blog: the impulse.

The discrete impulse is giving by the formula:

δ[n]={1,n=00,n0\delta[n] = \begin{cases} 1, & n = 0 \\ 0, & n \neq 0 \end{cases}

We can similarly define a shifted impulse by the below formula:

δ[nk]={1,n=k0,nk\delta[n-k] = \begin{cases} 1, & n = k \\ 0, & n \neq k \end{cases}
k = 2-45

Signals as a sum of scaled shifted impulses

This is one of the most foundational identities in signal processing. By representing a complete signal as a sum of shifted impulses, δ[nk]\delta[n-k], each scaled by its amplitude x[k]x[k], we can completely decompose a signal into its simplest parts and then reconstruct it perfectly.

impulse_decompositionx[n] = Σ x[k]·δ[n−k]
impulses at origin — all at n = 0
x[n] = [1, 0.5, 2.4]1·δ[n−0]0.5·δ[n−1]2.4·δ[n−2]

Mathematically, this can be represented as:

x[n]=k=x[k]δ[nk]x[n] = \sum_{k=-\infty}^{\infty} x[k] \delta[n-k]

Because of this identity, we don’t need to analyze how a system reacts to every possible signal x[n]x[n] that could ever exist. We only need to analyze how it reacts to δ[n]\delta[n].

  1. If the system is Linear, we can scale and add the responses.

  2. If the system is Time-Invariant, the response to a shifted impulse δ[nk]\delta[n-k] is just a shifted version of the original response.

The impulse response

The Impulse Response of a system is the output produced when the input is a single unit impulse, δ[n]\delta[n].

h[n]=T{δ[n]}h[n] = \mathcal{T}\{\delta[n]\}

You can think of it as the system’s fingerprint. Imagine striking a bell with a hammer. The sharp, instantaneous hit is like the impulse δ[n]\delta[n]. The ringing sound that follows is the impulse response h[n]h[n]. Every bell has its own characteristic ring: some are deep, some are bright, some sustain for a long time, others decay quickly. That ringing pattern reveals the essential physical behavior of the bell.

In the same way, the impulse response reveals the essential behavior of any systems.

The LTI Contract

We have already seen that any discrete-time signal can be written as a sum of shifted and scaled impulses:

x[n]=k=x[k]δ[nk]x[n] = \sum_{k=-\infty}^{\infty} x[k] \delta[n-k]

This decomposition becomes extremely powerful when the system is Linear and Time-Invariant (LTI).

  1. Time-invariance means the system responds the same way no matter when the impulse occurs. If the impulse is shifted to index kk, the response is simply a shifted copy of the original impulse response:

T{δ[nk]}=h[nk]\mathcal{T}\{\delta[n-k]\} = h[n-k]

impulse_response_plotT{δ[n−k]} = h[n−k]
shift k0drag → δ[n−k] and h[n−k] slide together
h[n] = [1, 0.7, 0.45, 0.25, 0.12, 0.05]impulse at k=0 → response h[n−0] starts at n=0

We can see from the above animation, as we shift the impulse δ[n]\delta[n] by kk (δ[nk]\delta[n-k]) the impulse response h[n]h[n] shift exactly the same by kk (h[nk]h[n-k]), nothing else changes, this is because the system is LTI.

  1. Linearity means scaling and summing inputs scales and sums the outputs in exactly the same way. If an impulse is multiplied by an amplitude x[k]x[k], its response is multiplied by that same amplitude.

Together, these two properties means we do not need to test the system with every possible signal. Since every signal is built from impulses, and since we know exactly how the system responds to one impulse, we can construct the response to any signal by summing the responses to all of its component impulses.

Convolution (Sum of scaled shifted impulse responses)

We now derive the convolution sum from first principles. The key idea is simple:

  • Any signal can be written as a sum of scaled, shifted impulses.

  • A linear system lets us handle each impulse contribution separately, then add the results together.

  • A time-invariant system responds to a shifted impulse by shifting the impulse response.

Puting these facts together force the convolution formula. Let’s dervive it step by step:

Decompose the input into shifted impulses

Any discrete-time signal can be written as:

x[n]=k=x[k]δ[nk]x[n] = \sum_{k=-\infty}^{\infty} x[k]\delta[n-k]

This means the signal is built from an impulse at every location kk scaled by the amplitude x[k]x[k]. So each sample x[k]x[k] can be viewed as a scaled impulse x[k]δ[nk]x[k]\delta[n-k].

Apply the system to the whole signal

Let the system be T\mathcal{T}, and let the output be y[n]y[n]. Then:

y[n]=T{x[n]}y[n] = \mathcal{T}\{x[n]\}

Substitute the impulse decomposition of x[n]x[n]:

y[n]=T{k=x[k]δ[nk]}y[n] = \mathcal{T}\left\{ \sum_{k=-\infty}^{\infty} x[k]\delta[n-k] \right\}

Use Linearity

Because the system is linear, it preserves sums and scalar multiplication. So we can move the system operator T\mathcal{T} inside the summation:

y[n]=k=T{x[k]δ[nk]}y[n] = \sum_{k=-\infty}^{\infty} \mathcal{T}\{x[k]\delta[n-k]\}

And since the amplitude x[k]x[k] is just a scalar constant with respect to the system, we can pull it out:

y[n]=k=x[k]T{δ[nk]}y[n] = \sum_{k=-\infty}^{\infty} x[k]\mathcal{T}\{\delta[n-k]\}

So this means each input sample x[k]x[k] scales the system’s response to an impulse at kk.

Define the Impulse Response

Now define the impulse response of the system as:

h[n]=T{δ[n]}h[n] = \mathcal{T}\{\delta[n]\}

This is the output produced when the input is a unit impulse at index 0.

Use Time-Invariance

Because the system is time-invariant, shifting the input impulse by kk shifts the output by the exact same amount. Since δ[nk]\delta[n-k] is just the impulse shifted to index kk:

T{δ[nk]}=h[nk]\mathcal{T}\{\delta[n-k]\} = h[n-k]

So the response to an impulse at kk is just a shifted copy of the impulse response.

Combine Both Results

Substitute this time-invariant property into the expression:

y[n]=k=x[k]T{δ[nk]}y[n] = \sum_{k=-\infty}^{\infty} x[k]\mathcal{T}\{\delta[n-k]\}

y[n]=k=x[k]h[nk]y[n] = \sum_{k=-\infty}^{\infty} x[k]h[n-k]

We obtain the final result:

y[n]=k=x[k]h[nk]\boxed{ y[n] = \sum_{k=-\infty}^{\infty} x[k]h[n-k] }

This is the Convolution Sum.

The Intuition Behind the Formula

Each input sample x[k]x[k]:

  • Behaves like a scaled impulse at location kk.

  • Launches a shifted copy of the impulse response h[nk]h[n-k].

  • Contributes that shifted response to the total output.

The final output is the sum of all these shifted, scaled copies:

y[n]=kx[k]h[nk]y[n] = \sum_k x[k]h[n-k]

So, convolution can be understood simply as the sum of scaled, shifted impulse responses.

Full Proof

Decompose the signal:

x[n]=kx[k]δ[nk]x[n] = \sum_k x[k]\delta[n-k]

Apply the system:

T{x[n]}=T{kx[k]δ[nk]}\mathcal{T}\{x[n]\} = \mathcal{T}\left\{\sum_k x[k]\delta[n-k]\right\}

By linearity:

y[n]=kx[k]T{δ[nk]}y[n] = \sum_k x[k]\mathcal{T}\{\delta[n-k]\}

By time invariance, if h[n]=T{δ[n]}h[n]=\mathcal{T}\{\delta[n]\}, then T{δ[nk]}=h[nk]\mathcal{T}\{\delta[n-k]\}=h[n-k]. Therefore:

y[n]=kx[k]h[nk]y[n] = \sum_k x[k]h[n-k]

Which is the convolution sum.

In one sentence: Convolution is just the sum of all the shifted impulse responses generated by each input sample, scaled by that sample’s amplitude.

Visualization of Convolution

convolution_ploty[n] = x[n] ∗ h[n]
input x[n] and system h[n]
x[n] = [1, 2, 0.6, 0.9]h[n] = [1.5, 0.5]y[n] = [1.5, 3.5, 1.9, 1.65, 0.45]

Each colored row is one term of the sum: the input value x[k]x[k] scales a copy of h[n]h[n], and that copy is time-shifted to start at position kk. The output y[n]y[n] at any index nn is the vertical sum of every row’s value at that nn — in the animation, the rows literally collapse downward into y[n]y[n].

The Flip: Where It Comes From

Most explanations present the flipped kernel as a rule you have to memorize. It is not. It falls directly out of the convolution sum, and once you see why, it never feels arbitrary again.

Recall the convolution sum:

y[n]=k=0N1x[k]h[nk]y[n] = \sum_{k=0}^{N-1} x[k] \cdot h[n - k]

Fix nn at some value — say n=2n = 2 — and read the argument h[nk]h[n - k] as kk increases from 0,1,2,0, 1, 2, \ldots

k=0    h[20]=h[2]k = 0 \;\Rightarrow\; h[2 - 0] = h[2] k=1    h[21]=h[1]k = 1 \;\Rightarrow\; h[2 - 1] = h[1] k=2    h[22]=h[0]k = 2 \;\Rightarrow\; h[2 - 2] = h[0]

The argument nkn - k runs backwards as kk increases. You are reading hh at indices 2,1,02, 1, 0 — that is hh in reverse order, sitting over the input window x[0],x[1],x[2]x[0], x[1], x[2].

That reversed reading is the flipped kernel. Nobody chose to flip it. The subtraction nkn - k flips it automatically.

The vertical slice picture

Think of laying out all the shifted impulse responses as rows in a table:

row kkvalues at each nn
k=0k=0x[0]h[n]x[0] \cdot h[n]
k=1k=1x[1]h[n1]x[1] \cdot h[n-1]
k=2k=2x[2]h[n2]x[2] \cdot h[n-2]

Now slice vertically at a fixed nn. Reading downward through column nn:

row 0:  h[n0],row 1:  h[n1],row 2:  h[n2]\text{row 0}: \; h[n-0], \quad \text{row 1}: \; h[n-1], \quad \text{row 2}: \; h[n-2]

The second argument steps backwards: n,  n1,  n2,  n,\; n-1,\; n-2,\;\ldots You are reading hh from its latest value back to its earliest — exactly a reversed hh aligned over the input at position nn.

Why the sliding view needs the flip

In the sliding kernel view, the kernel window moves left to right over the input. At position nn, the leftmost box of the window touches x[n]x[n] and the rightmost touches x[nM+1]x[n - M + 1] — the oldest input in the window.

For the products to match the convolution sum, h[0]h[0] must multiply the newest input (x[n]x[n]) and h[M1]h[M-1] must multiply the oldest. But if you lay hh left to right without flipping, h[0]h[0] ends up on the oldest input. Flipping hh corrects the alignment so the products are identical to what the impulse sum computes.

One sentence: the flip is the geometric consequence of the minus sign in h[nk]h[n - k], and that minus sign is what makes later inputs subtract from the current time index — which is the definition of a causal delay.

Visual Representation

convolution_table.jsxbracket + sliding view
n = 0 → y[0] = 1.00
x = [1, 2, 1, 3, 2]h = [1, 0.5, 0.25]h_flip = [0.25, 0.5, 1]output length = 7

How to Read This Animation

The cyan bracket in the top half and the kernel boxes in the bottom half are always showing you the same thing — just from two different angles. Every time you press Next, both move together by one step. The products listed on the right of the table are identical to the products shown inside the kernel boxes. The output y[n]y[n] at the bottom right of the sliding view matches the green cell in the sum row. They are not two different calculations. They are two ways of reading the same table.

Two Ways to Think About Convolution

Way 1 — The Impulse Response View (read by row)

This is the original physical intuition. You feed the system one input sample at a time. Each sample x[k]x[k] fires the system, and the system responds with a scaled, shifted copy of its impulse response hh:

x[k]    x[k]h[nk]x[k] \;\longrightarrow\; x[k] \cdot h[n - k]

So x[0]x[0] fires hh at position 0 with height x[0]x[0].
Then x[1]x[1] fires hh at position 1 with height x[1]x[1].
Then x[2]x[2] fires hh at position 2 with height x[2]x[2].
And so on.

The output y[n]y[n] is what you get when all those echoes land on top of each other and add. In the table, each row is one echo. Reading across a row you see where that echo travels. Reading down a column you see every echo that has landed at that moment in time.

y[n]=k=0N1x[k]h[nk]y[n] = \sum_{k=0}^{N-1} x[k] \cdot h[n - k]

This view answers the question: “which input samples contributed to this output?”

Way 2 — The Sliding Kernel View (read by column)

This is the computational view. Instead of thinking about firing echoes, you slide a small window across the input and compute a dot product at each position.

At output position nn, the flipped kernel hfliph_{\text{flip}} sits over nn consecutive input samples. You multiply each kernel weight by the input value directly beneath it, then sum. That sum is y[n]y[n].

y[n]=j=0M1hflip[j]x[n+j(M1)]y[n] = \sum_{j=0}^{M-1} h_{\text{flip}}[j] \cdot x[n + j - (M-1)]

This view answers the question: “what input values is the kernel currently looking at?”

The Flip — One Last Time

Notice that hflip=[h[M1],,h[1],h[0]]h_{\text{flip}} = [h[M-1], \ldots, h[1], h[0]]hh read backwards. Why?

In the impulse view, row kk contributes h[nk]h[n-k] to column nn. As kk increases from 00 to M1M-1 down the column, the argument nkn - k decreases from nn to nM+1n - M + 1. You are reading hh backwards through the column.

In the sliding view, the leftmost kernel box touches the oldest input in the window, and the rightmost touches the newest. For the products to match the impulse view, h[0]h[0] must be paired with the newest input — which means it must sit on the right of the kernel window. Flipping hh puts h[0]h[0] on the right.

The flip is not an arbitrary convention. It is the geometric consequence of the minus sign in h[nk]h[n - k].