when people first meet convolutions, it often feels like a strange formula:
Then later, in deep learning (explained in part 2), they see Conv1d implemented more like:
and hear that this is actually cross-correlation, not strict convolution.
So what is really going on?
This post builds the idea carefully from the ground up.
We will go through:
- What a signal is
- What a 1D discrete signal is
- What a system is
- Linearity
- Time invariance
- The unit impulse
- Signals as a sum of scaled shifted impulses
- The impulse response
- Convolution (sum of scaled impulse responses)
- The flip — where it comes from
- Real world example (Audio reverb)
A signal
A signal is any function that conveys information about the behavior of a certain phenomenon. In signal processing, this is mathematically represented as a variation of a quantity with respect to time, space, or another independent variable. It can be mathematically represented as where , represent the independent variable (usually time), the value of the function at any time represent the amplitude or intensity of the signal at that time.
Signals, can be represented as continuous or discrete, for this is a continuous signal whereby time is continuous and there’s a value at every position of the time.
The discrete signal is represented by sampling the continuous signal at a given interval of where is a fixed sampling period, and This defines the discrete signal as a mapping thereby, we can represent the function as .
You can think of a signal as carriers of information. Anything that changes over time or space to tell us something about the world, is a signal. a simpler example is an acoustic signal, whereby the signal can be the variation of pressure caused by the vocal cords or an instruments. when you record a speech, a microphone converts these pressure waves into an analog voltage (a continuous-time signal, ), which is then sampled by an ADC (Analog-to-Digital Converter) into a digital sequence (a discrete-time signal, ). In this blog, we will mostly work with discrete signal.
A concrete example of a discrete signal can be given as this
We can write this as the sequence:
and can be represented visually on the graph below
We can see from the graph that while we only list the values of interest, the signal is technically defined for all . For any index not shown in our sequence (like or ), we can assume the amplitude is zero. Mathematically, we say this signal has finite support (i.e outside these four values, the signal is zero everywhere).
Shifting a Discrete Signal
if we have a signal , we can perform shift operator on the signal, which is defined as , then is just the signal shifted by . if , then the signal is shifted to the right, otherwise, it’s shifted to the left.
Visual Representation
Why the sign feels backwards
When you see with , instinct says “subtracting means moving left.” But the opposite happens — the signal shifts right.
Here is why. The expression asks: “give me the value that used to live at position .” At with , you are reading — a value from earlier in the sequence. So the same value that lived at is now being read at . It moved two steps to the right.
Think of it this way: you are not moving the value — you are moving the reading head left by , which makes the waveform appear to slide right. Subtracting from the input index delays the signal, and delay is rightward on the -axis.
System
A system is a transformational rule, that takes one input signal , and produce an output signal .
We write it as:
or equivalently .
A common example of a system is an amplifier, whereby your audio signal get amplified by a factor (can be thought of as volume), mathematically can be represented like this:
assuming you record an audio represented by a signal , a linear amplifier (the system) can produce an output that is as loud, that is just like increasing or decreasing the loudness of your speech.
some examples of a systems applied to :
1. Scaling: — multiply every sample by 3, amplitudes scale up but positions stay the same.
2. Delay: — shift the entire signal two steps to the right.
3. Compression: — read every second sample, squashing the signal in time.
Linearity in a systems
when we say a system is linear, then it must respect two conditions: additivity and homogeneity:
1. Additivity (The Summation Rule)
If you have two different signals, and , and you feed them into the system separately, they produce and .
If the system is additive, feeding the sum of those signals into the system will produce the sum of the individual outputs.
This can be shown visually below — first with a system that passes the test, then one that fails:
Additive system —
Non-additive system —
2. Homogeneity (The Scaling Rule)
This rule states that if you scale (multiply) the input by a constant factor , the output is scaled by that same factor. There are no “surprises” or exponential jumps.
In other word, if a system can pass this test, then it’s a Linear System
Linearity allows us to use Decomposition. Since we know how a linear system treats a single point, we can break any complex signal down into those simple parts, see how the system handles them, and then add them back together at the end.
Time Invariance
In plain English, a system is Time-Invariant if its behavior doesn’t change over time. If you give the system an input today, or you give it that same input tomorrow, the output should be identical just shifted in time.
Imagine you are testing an audio system (like an echo effect):
-
Step 1: You feed the system a signal and record the output .
-
Step 2: You wait a few seconds (a shift of ) and feed it the exact same signal .
-
Step 3: You check the output. If the system is Time-Invariant, the new output will be exactly the same as the first one, just delayed by : .
Mathematically, A system is time-invariant if shifting the input causes the output to shift in the same way.
-
Left side: Shift the input first, then put it through the system.
-
Right side: Put the input through the system first, then shift the output.
If these two are equal, then we say the system is Time-Invariant. Example,
The Echo Effect: If you clap your hands in a large hall at 2:00 PM, you hear an echo. If you clap your hands exactly the same way at 5:00 PM, the hall produces the exact same echo. The room’s “rules” for bouncing sound didn’t change between 2:00 and 5:00.
Another way to see how time invariant might work visually is shown below,
The Unit Impulse
Now we introduce one of the most important signal in this blog: the impulse.
The discrete impulse is giving by the formula:
We can similarly define a shifted impulse by the below formula:
Signals as a sum of scaled shifted impulses
This is one of the most foundational identities in signal processing. By representing a complete signal as a sum of shifted impulses, , each scaled by its amplitude , we can completely decompose a signal into its simplest parts and then reconstruct it perfectly.
Mathematically, this can be represented as:
Because of this identity, we don’t need to analyze how a system reacts to every possible signal that could ever exist. We only need to analyze how it reacts to .
-
If the system is Linear, we can scale and add the responses.
-
If the system is Time-Invariant, the response to a shifted impulse is just a shifted version of the original response.
The impulse response
The Impulse Response of a system is the output produced when the input is a single unit impulse, .
You can think of it as the system’s fingerprint. Imagine striking a bell with a hammer. The sharp, instantaneous hit is like the impulse . The ringing sound that follows is the impulse response . Every bell has its own characteristic ring: some are deep, some are bright, some sustain for a long time, others decay quickly. That ringing pattern reveals the essential physical behavior of the bell.
In the same way, the impulse response reveals the essential behavior of any systems.
The LTI Contract
We have already seen that any discrete-time signal can be written as a sum of shifted and scaled impulses:
This decomposition becomes extremely powerful when the system is Linear and Time-Invariant (LTI).
- Time-invariance means the system responds the same way no matter when the impulse occurs. If the impulse is shifted to index , the response is simply a shifted copy of the original impulse response:
We can see from the above animation, as we shift the impulse by () the impulse response shift exactly the same by (), nothing else changes, this is because the system is LTI.
- Linearity means scaling and summing inputs scales and sums the outputs in exactly the same way. If an impulse is multiplied by an amplitude , its response is multiplied by that same amplitude.
Together, these two properties means we do not need to test the system with every possible signal. Since every signal is built from impulses, and since we know exactly how the system responds to one impulse, we can construct the response to any signal by summing the responses to all of its component impulses.
Convolution (Sum of scaled shifted impulse responses)
We now derive the convolution sum from first principles. The key idea is simple:
-
Any signal can be written as a sum of scaled, shifted impulses.
-
A linear system lets us handle each impulse contribution separately, then add the results together.
-
A time-invariant system responds to a shifted impulse by shifting the impulse response.
Puting these facts together force the convolution formula. Let’s dervive it step by step:
Decompose the input into shifted impulses
Any discrete-time signal can be written as:
This means the signal is built from an impulse at every location scaled by the amplitude . So each sample can be viewed as a scaled impulse .
Apply the system to the whole signal
Let the system be , and let the output be . Then:
Substitute the impulse decomposition of :
Use Linearity
Because the system is linear, it preserves sums and scalar multiplication. So we can move the system operator inside the summation:
And since the amplitude is just a scalar constant with respect to the system, we can pull it out:
So this means each input sample scales the system’s response to an impulse at .
Define the Impulse Response
Now define the impulse response of the system as:
This is the output produced when the input is a unit impulse at index 0.
Use Time-Invariance
Because the system is time-invariant, shifting the input impulse by shifts the output by the exact same amount. Since is just the impulse shifted to index :
So the response to an impulse at is just a shifted copy of the impulse response.
Combine Both Results
Substitute this time-invariant property into the expression:
We obtain the final result:
This is the Convolution Sum.
The Intuition Behind the Formula
Each input sample :
-
Behaves like a scaled impulse at location .
-
Launches a shifted copy of the impulse response .
-
Contributes that shifted response to the total output.
The final output is the sum of all these shifted, scaled copies:
So, convolution can be understood simply as the sum of scaled, shifted impulse responses.
Full Proof
Decompose the signal:
Apply the system:
By linearity:
By time invariance, if , then . Therefore:
Which is the convolution sum.
In one sentence: Convolution is just the sum of all the shifted impulse responses generated by each input sample, scaled by that sample’s amplitude.
Visualization of Convolution
Each colored row is one term of the sum: the input value scales a copy of , and that copy is time-shifted to start at position . The output at any index is the vertical sum of every row’s value at that — in the animation, the rows literally collapse downward into .
The Flip: Where It Comes From
Most explanations present the flipped kernel as a rule you have to memorize. It is not. It falls directly out of the convolution sum, and once you see why, it never feels arbitrary again.
Recall the convolution sum:
Fix at some value — say — and read the argument as increases from
The argument runs backwards as increases. You are reading at indices — that is in reverse order, sitting over the input window .
That reversed reading is the flipped kernel. Nobody chose to flip it. The subtraction flips it automatically.
The vertical slice picture
Think of laying out all the shifted impulse responses as rows in a table:
| row | values at each |
|---|---|
Now slice vertically at a fixed . Reading downward through column :
The second argument steps backwards: You are reading from its latest value back to its earliest — exactly a reversed aligned over the input at position .
Why the sliding view needs the flip
In the sliding kernel view, the kernel window moves left to right over the input. At position , the leftmost box of the window touches and the rightmost touches — the oldest input in the window.
For the products to match the convolution sum, must multiply the newest input () and must multiply the oldest. But if you lay left to right without flipping, ends up on the oldest input. Flipping corrects the alignment so the products are identical to what the impulse sum computes.
One sentence: the flip is the geometric consequence of the minus sign in , and that minus sign is what makes later inputs subtract from the current time index — which is the definition of a causal delay.
Visual Representation
How to Read This Animation
The cyan bracket in the top half and the kernel boxes in the bottom half are always showing you the same thing — just from two different angles. Every time you press Next, both move together by one step. The products listed on the right of the table are identical to the products shown inside the kernel boxes. The output at the bottom right of the sliding view matches the green cell in the sum row. They are not two different calculations. They are two ways of reading the same table.
Two Ways to Think About Convolution
Way 1 — The Impulse Response View (read by row)
This is the original physical intuition. You feed the system one input sample at a time. Each sample fires the system, and the system responds with a scaled, shifted copy of its impulse response :
So fires at position 0 with height .
Then fires at position 1 with height .
Then fires at position 2 with height .
And so on.
The output is what you get when all those echoes land on top of each other and add. In the table, each row is one echo. Reading across a row you see where that echo travels. Reading down a column you see every echo that has landed at that moment in time.
This view answers the question: “which input samples contributed to this output?”
Way 2 — The Sliding Kernel View (read by column)
This is the computational view. Instead of thinking about firing echoes, you slide a small window across the input and compute a dot product at each position.
At output position , the flipped kernel sits over consecutive input samples. You multiply each kernel weight by the input value directly beneath it, then sum. That sum is .
This view answers the question: “what input values is the kernel currently looking at?”
The Flip — One Last Time
Notice that — read backwards. Why?
In the impulse view, row contributes to column . As increases from to down the column, the argument decreases from to . You are reading backwards through the column.
In the sliding view, the leftmost kernel box touches the oldest input in the window, and the rightmost touches the newest. For the products to match the impulse view, must be paired with the newest input — which means it must sit on the right of the kernel window. Flipping puts on the right.
The flip is not an arbitrary convention. It is the geometric consequence of the minus sign in .