Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
What is the correct cost formula for the average case in insertion sort algorithm?
I have been trying to learn about the cost of the insertion sort algorithm lately. I already understand the best case, worst case and average case formulas (eg $n-1$, $\frac{n(n-1)}{2}$, $\frac{n(n-1)}{4}$ respectively). But I saw that this formula isn't working for small numbers of $n$. If we take for example an array with a length of $n=3$ with $T(n)$ for the amount of comparison needed, we get $T_{\text{best}}(3)=3-1=2$, $T_{\text{worst}}(3)=\frac{3(2)}{2}=3$, and $T_{\text{avg}}(3)=\frac{3(2)}{4}=\frac{3}{2}=1.5$. This is very strange, since the average case is more efficient than the best case. And that is impossible. That means that $\frac{n(n-1)}{4}$ doesn't hold for this situation. So what I did was looking at the graph of these functions:
It seems like that these formulas only works for $n\ge4$. Is that correct? So it basically means that, whenever we have a cost formula for an algorithm, it will become more accurate when $n$ goes larger?
I also came across this page that says that the formula $\frac{n(n-1)}{4}$ is actually wrong. I already didn't understand the part where he came up with the first formula:
$\frac{1}{2}(1_{\text{element is in place}} + i_{\text{element is smallest yet}})$
According to the answer it should have been $$\sum_{i=1}^{n-1} \frac{i+1}2 = \frac{(n-1)n}4 + \frac{n-1}2 = \frac{(n-1)(n+2)}{4}$$ instead. Can someone please explain to me how he came up with this? And why the original one $\frac{n(n-1)}{4}$ wrong?
1 answer
The difference (and the page that you cite actually does mention it, even if broadly) is that algorithmic analysis has interests and (arguably) traditions that most people find counter-intuitive. Specifically, an algorithm's complexity represents the asymptotic nature of how the resource usage (usually time, but not always) grows, and a lot of textbooks will refer to this as "asymptotic notation," as a reminder.
This means a few things.
- We only care about the fastest-growing expression. This has theoretical reasons in terms of simplifying the analysis to whatever part of the algorithm dominates the conversation, but also practical applications, in that optimizing the piddly logarithmic part of an algorithm doesn't matter when another part runs in factorial time. I'll call this out at the end, because this answers your "bigger" question.
- The complexity only "matters" where it's monotonic, only moving in one direction. If adding another input element sometimes increases the resource usage and sometimes decreases it, none of the analytical methods can deal with that.
- We pretend that we have infinite input, because (a) that's probably where the function is most definitely monotonic, and (b) algorithmic analysis (and optimization) matter most in streaming situations, where you can either keep up with the input or fail miserably. Also, for small input sizes, you could just preprocess everything and not care at all, no matter how long it takes. This is so ingrained that weirdness like you noticed for small values gets called (and dismissed as) a "startup anomaly," something mildly interesting if you only deal with those small amounts, but forgotten quickly as the inputs get bigger.
- Finally, we mostly ignore constants (addends and factors), because we can resolve those by spending money on better hardware.
When somebody says that an analysis is "wrong," because it overlooks a term in the sum (as the answer that you pointed to does), they mean that they'd rather be more precise. But as that answer indicates, it probably doesn't matter, because we only care about that dominant term for most purposes in theory and practice, that the quadratic-growth part.
As for where that extra term comes from, think of summation like integration, as in calculus.
$$\sum{{i + 1}\over{2}} \approx \int{{i + 1}\over{2}} = \int{{i}\over{2}} + {{1}\over{2}} = {{i^2\over{4}} + {n\over{2}}}$$Don't actually do this on an exam, by the way. It's going to give you the right answer, but summation and integration are technically distinct, in that calculus only works on continuous functions and we treat computers as discrete and cost measurements as isolated data points, so mixing the operations will make any serious instructor extremely uncomfortable...
And technically, we should add + c to the end of that result, too, an "arbitrary constant," because if we take the derivative of the resulting function, adding any constant from -∞ to 0 to ∞ will give the same function that we integrated/summed. But again, we drop that, because (a) we didn't really want to integrate, and (b) the constant-time term wouldn't matter for any serious algorithmic analysis.
0 comment threads