This article assumes some degree of understanding of data structures such as arrays, lists, maps, also called dictionaries or associative arrays. If you’re not already familiar with these, the Wikipedia page on data structures might be a good start.
tl;dr; A data structure with a worst time complexity might be faster for a given operation than another one with a better time complexity, under the right circumstances. It’s often the case with small collections where a list or an array might outperform a hash table for the “contains” operation despite the time complexities being respectively O(n) and O(1). Many abstractions provided by programming languages already do some form of optimizations, such as the Set
class in scala. Redis does something similar for hashes and sorted sets.
Does this collection contain this item?
Let’s assume we have some integers in memory, and we want to check if a value entered by the user is contained within these integers. A potential solution in JS would look like this:
let numbers = [1, 2, 3, 4, 5]
numbers.includes(2) // => true
numbers.includes(6) // => false
The Array.includes
function does exactly what we want, great. You’re happy with your solution, you commit this, open a PR and receive the following comment:
We should use a Set instead of an Array here, and use the
has
method instead ofincludes
, since it has an O(1) complexity (includes is O(n)).
The second half of the comment about time complexity is true, the first part though, debatable. Should we use the one with the better time complexity? If so, why?
Let’s look as some numbers with the benchmark npm package:
var Benchmark = require('benchmark');
var suite = new Benchmark.Suite;
var set = new Set();
set.add(1);
set.add(2);
set.add(3);
var array = [];
array.push(1,2,3);
suite.add('Set.has (small)', function() {
set.has(1)
set.has(2)
set.has(3)
set.has(0)
})
.add('Array.includes (small)', function() {
array.includes(1)
array.includes(2)
array.includes(3)
array.includes(0)
})
.on('cycle', function(event) {
console.log(String(event.target));
})
.on('complete', function() {
console.log('Fastest is ' + this.filter('fastest').map('name'));
})
.run({ 'async': true });
The following is the result I got on my machine (a 2020 mac mini):
Set.has (small) x 25,580,981 ops/sec ±0.67% (93 runs sampled)
Array.includes (small) x 1,014,367,099 ops/sec ±0.28% (99 runs sampled)
Fastest is Array.includes (small)
The array version is faster, by a lot, we’re looking at a ~40x difference here. Another takeaway, both versions are fast enough for many use cases. Even the set version, the slowest version here, was able to run ~25 million operations per second. On average each call to has
took ~0.00000004s, that’s 0.00004ms or 0.04μs or 40ns. For includes
it is 0.000000001s, 0.000001ms, 0.001μs and 1ns.
I ran the same benchmark on the cheapest Digital Ocean droplet on ubuntu and got similar relative results, everything being slower, ~10 million ops/s for Set.has
and ~310 million ops/s for Array.includes
.
The conclusion here is that for a teeny tiny collection, using nodejs 15, an Array is way faster than a Set. Things might be different in a browser, I don’t know how the different engines have implemented Arrays and Sets.
So why do we care about time complexity if it’s not an indicator of speed? It’s because it tells us how different the execution time will be depending on how many “things” we’re dealing with. That’s the n piece, in this case, the number of items in the Array or Set.
Let’s run the same benchmark, but with a larger collection this time. I re-wrote it so we can parameterize the size of the collection. The items we search in the collections are always the same to try to make it an “apple to apple” comparison. The code is available in this Gist.
Up until 40,000 items, the Array lookup is faster, ~1 billion ops/s vs ~20 million ops/s. With 50,000 items, the Set version still runs about the same number of operations per second, ~20 millions, but the Array one drops to ~500,000 ops/s.
That’s why we care about time complexity, not because it tells us if one version is faster than another but because it tells us how the running time changes depending on the input size. O(1), aka “constant time complexity”, tells us that the Set version will be roughly the same regardless of how many items we throw at it. On the other hand, O(n), aka “linear time complexity”, will slowly gets slower and slower as the input size grows. In the array case, this makes sense, to return a response for array.includes(0)
, we need to iterate through the entire array to return false
, the number of iterations is directly correlated to the size of the array.
So with this newfound knowledge, should we change the implementation? If it was up to a vote, I’d vote for using a Set
, but not because it has a better complexity. Let’s be honest, in many many cases, the time difference observed here doesn’t make a difference, but because it conveys the intent a little bit more explicitly. By using a Set
you’re saying that you’re dealing with a unique collection of integers, whereas an Array is not as specific.
Is this specific to JavaScript?
No.
I ran a similar test in ruby, comparing an Array, a Hash and a Set, and obtained the following results with respectively three items in the collections, and 1,000,000:
Warming up --------------------------------------
Set (3) 592.205k i/100ms
Array (3) 782.067k i/100ms
Hash (3) 863.547k i/100ms
Calculating -------------------------------------
Set (3) 5.915M (± 1.1%) i/s - 29.610M in 5.006554s
Array (3) 7.771M (± 1.4%) i/s - 39.103M in 5.032784s
Hash (3) 8.627M (± 0.7%) i/s - 43.177M in 5.004990s
Comparison:
Hash (3): 8627331.0 i/s
Array (3): 7771408.3 i/s - 1.11x (± 0.00) slower
Set (3): 5915026.1 i/s - 1.46x (± 0.00) slower
Warming up --------------------------------------
Set (1000000) 578.593k i/100ms
Array (1000000) 3.159k i/100ms
Hash (1000000) 794.715k i/100ms
Calculating -------------------------------------
Set (1000000) 5.702M (± 1.0%) i/s - 28.930M in 5.074141s
Array (1000000) 31.556k (± 1.4%) i/s - 157.950k in 5.006397s
Hash (1000000) 7.955M (± 1.7%) i/s - 40.530M in 5.096204s
Comparison:
Hash (1000000): 7955389.3 i/s
Set (1000000): 5701948.8 i/s - 1.40x (± 0.00) slower
Array (1000000): 31555.9 i/s - 252.10x (± 0.00) slower
With only three elements, the Set is the slowest option, and surprisingly the Hash is the fastest. I did not dig in the C source of Ruby to figure out why, but I have to assume that there a lot of internal optimizations that skip over some of the unnecessary steps. This is likely similar to what Scala does, which we will take a look at in the next section.
With a larger collection, 1,000,000 items, the Array becomes the slowest option, as we expected, since it has an O(n) time complexity, aka, the larger the array, the slower it is to iterate through it. The Hash and Set have similar numbers, due to their constant time complexity.
The code is available in this Gist
Sets and Maps in scala
I did not run any benchmarks in Scala, but I thought it was really interesting to look at the source of Set
and Map
in the standard library.
Scala’s standard library provides Set
. What makes it interesting is that Set
is a trait (if you’re not familiar with Scala, it’s very similar to a Java interface), and the actual class that you get depends on the size of the set:
scala> Set(1).getClass
val res0: ... scala.collection.immutable.Set$Set1
scala> Set(1, 2).getClass
val res1: ... scala.collection.immutable.Set$Set2
scala> Set(1, 2, 3).getClass
val res2: ... scala.collection.immutable.Set$Set3
scala> Set(1, 2, 3, 4).getClass
val res3: ... scala.collection.immutable.Set$Set4
scala> Set(1, 2, 3, 4, 5).getClass
val res4: ... scala.collection.immutable.HashSet
scala> Set(1, 2, 3, 4, 5, 6).getClass
val res5: ... scala.collection.immutable.HashSet
Scala has four different concrete classes, all implementing the Set
trait, Set1
, Set2
, Set3
, Set4
and HashSet
. The first four are very similar, they are written as classes with, respectively, one, two, three and four members. This makes the implementation of the contains
method (similar to has
in JS) pretty concise:
def contains(elem: A): Boolean = elem == elem1
elem1
is the single member of the Set1
class.
and for Set2
, Set3
and Set4
:
// Set2:
def contains(elem: A): Boolean = elem == elem1 || elem == elem2
// Set3:
def contains(elem: A): Boolean =
elem == elem1 || elem == elem2 || elem == elem3
// Set4:
def contains(elem: A): Boolean =
elem == elem1 || elem == elem2 || elem == elem3 || elem == elem4
You can see the source of all four in the Set.scala file.
The final class, HashSet
, is the “real” one, where the size of the Set
could be anything. It uses a hashing algorithm to implement the O(1) contains
method. The source is available in the HashSet.scala file
Scala does the same thing with Map
, where Map1
has two members, a key and a value, Map2
has four members, two keys and two values, and so on. The source is available in the Map.scala file. For Map
instances containing more than four pairs, it uses a HashMap
Hashes, Sets and Sorted Sets in Redis
Redis provides Hashes, Sorted Sets and Sets, among other data types. For each of these, it uses two different underlying data structures, one for small collections, and one for larger collections.
For Hashes, if the number of items is below the hash-max-ziplist-entries
config value, which has a default value of 512), Redis uses a ziplist, which is essentially an optimized linked list. No hashing algorithm, no collision, none of that. You send the HEXISTS
command to Redis for a hash with less 512 keys, it will iterate through the list of key and values until it finds it, or not. You can see it for yourself in the source.
For Sorted Sets, it uses a similar approach, it also stores entries in a ziplist. This is controlled by the zset-max-ziplist-entries
config, which has a default value of 128.
Redis Sets use a Hash under the hood, but only when the number of entries is greater than set-max-intset-entries
, which has a default value of 512. An intset
is essentially a sorted array of integers.
Conclusion
If I have one piece of advice to wrap this post with, please do not use this as a justification to replace a Set
in JavaScript or Ruby with something else, and justifying with “but it is faster!”. Unless it matters for your application that the has/contains operation takes 1ns instead of 40ns, but I think it’s safe to say that in many cases, it probably doesn’t matter.
I personally try to avoid as much as possible the terms “fast” and “slow”. They tend to be meaningless without the appropriate context. Instead I like to ask the question “is it fast enough?”, or “this is not the fastest implementation, but does it matter?”.
And if you want to learn more about Redis and how it actually works, I’m writing a book about it: Rebuilding Redis in Ruby.
The pedantic corner
I took some liberties with the terms “Data Structures” and “Abstractions” in this article. They’re not the same and it can be important to differentiate them. Following the Wikipedia page for Abstract Data Types (ADTs):
An abstract data type is defined by its behavior (semantics) from the point of view of a user, of the data, specifically in terms of possible values, possible operations on data of this type, and the behavior of these operations. This mathematical model contrasts with data structures, which are concrete representations of data, and are the point of view of an implementer, not a user
With this definition, if we were to label the elements we looked at in the Scala example, the Set
and Map
traits are ADTs, and HashSet
and HashMap
are the data structures providing the actual implementations. Set1
, Map1
and the other similar ones would be categorized as Record data structures.