There are two widely known laws in parallel computing that are counterpoints to each other, but with thought they can be applied together to achieve dramatic improvements in parallel computing. They two laws are Amdahl's Law and Gustafson's Law. Roughly speaking, Amdahl's says you cannot speed a system up by parallelization to be faster than the the slowest portion. Using a simple metaphor, if you have a car, a bicycle, and a turtle, when traveling in a pack, you can only move as fast as the turtle. The counterpoint to this is Gustafson's law, which roughly states that for parallel problems, adding more resources can increase the net work completed. Using are above metaphor, if we add more turtles (or cars or bicyclists) we are still increase the amount distance travelled per unit time. I do apologize to Computer Scientists as this is a very rough approximation, but this way of thinking about the situation can help us understand how to use some trickery and slight of hand to make things "impossibly" faster/better.
Using these concepts, let's talk about a real problem: Suppose we have a web application that needs to do 15 seconds of computing in order to present a visual representation of some data to a user. We happen to know that the slowest portion takes 5 seconds and must be performed serially. furthermore, let's suppose that the we have a system, that based on a user's name, can compute the following:
 Compute user's favorite number  1 second
 Compute user's favorite food  2 seconds
 Compute user's favorite pet  3 seconds
 Compute user's Birthday  4 seconds
 Compute user's favorite color  5 seconds
Applying Amdahl's law, we can conclude that the fastest parallel implementation should take 5 seconds to finish computing (and require 3 processors). That is, if we started all 5 computations at the same time, we would not be fully complete until 5 seconds had passed. Optimally, with three processors: 1 processor could compute favorite color, one processor could compute Birthday + favorite number, and one processor could compute favorite food + favorite pet. This would shorted the entire transaction time to 5 seconds.
Thinking about the problem in terms of Gustavson's Law, however, we also discover that adding another 2 processors gives us an opportunity to do more work (and/or support more capacity). Put another way, suppose we also know we can also do the following:
 Compute user's favorite song  1 second
 Compute user's shoe size  2 seconds
 Compute user's favorite flavor ice cream  3 seconds
 Compute user's favorite drink  4 seconds
Knowing this, by simply adding two more processors, we can return almost double the data for our users. Furthermore, applying some web trickery, pipelining, and user experience magic, we can potentially fool the user into thinking that every operation takes 1 second. For example, we know that a user will typically only look at two data items at a time and, they will typically look at each screen for 2 seconds, we can display data is up to 15 seconds old, caching the data on the client is possible, and there is no strong preference for which data item they may want to look at first (yes, these are quite a few suppositions, but not unusual), we could do the following:
 break the display into 5 sections (in order of display):

 favorite song + favorite number: Reachable after 1 second
 shoe size + favorite food: Reachable after 3 seconds
 favorite flavor ice cream + favorite pet: Reachable after 5 seconds
 Birthday + favorite drink: Reachable after 7 seconds
 favorite color: Reachable after 9 seconds
 Add more processors and schedule the work as follows (there are other combinations that also work):
 Processor 1: favorite number, then favorite food (number is done after 1 second, food after 3 seconds...processor is idle for 4 seconds)
 Processor 2: favorite song, then shoe size (song is done after 1 second, shoe size after 3 seconds...processor is idle for 4 seconds)
 Processor 3: favorite pet, then Birthday (pet is done after 3 seconds, birthday after 7 seconds...processor is fully utilized
 Processor 4: favorite ice cream, then favorite drink (ice cream is done after 3 seconds, favorite drink after 7 seconds, processor is fully utilized)
 Processor 5: favorite color (color is done after 5 seconds, processor is idle for 2 seconds)
By fiddling with the user experience, we've stretched our window of time and amount of compute resources available for the longer running serial operations. This means we could even optimize a bit further if we wanted by moving things around in the experience. What we've effectively done is given ourselves an opportunity to give the appearance that a 15 second operation actually takes no longer than 1 second. Note, this DOES require some compromise, but the general approach is one that can yield practical benefits even when traditional logic (and thinking purely from a sound theoretical perspective) might say it isn't possible.