Daily Coding Problem 4
This problem was asked by Stripe.
Given an array of integers, find the first missing positive integer in linear time and constant space. In other words, find the lowest positive integer that does not exist in the array. The array can contain duplicates and negative numbers as well.
For example, the input [3, 4, 1, 1]
should give 2
. The input [1, 2,
0]
should give 3
.
You can modify the input array inplace.
Solution overview
Constraints and assumptions

0
is not considered a positive integer. In the first example, the answer is expected to be2
. If0
were considered a positive integer, the answer would have been0
. 
Linear time complexity, also known as
O(n)
. To quote Wikipedia... there is a constant
c
such that the running time is at mostcn
for every input of sizen
. 
Constant space complexity, also known as
O(1)
. To quote StackOverflow...
O(1)
denotes constant (10, 100 and 1000 are constant) space and DOES NOT vary depending on the input size (sayN
).
What do these constraints mean?
Linear time complexity
Linear time complexity limits the algorithms that can be used to solve this problem. For example, a comparisonbased sort cannot
perform better than O(n log n)
in the worst case, which means that we cannot use any
comparisonbased sorting algorithm to solve this problem.
Having said that, there are noncomparison sorting algorithms that, at least on paper, can have a linear worstcase time complexity.
Constant space complexity
This means that we cannot change the space used by the algorithm as a function of the input size. Or, in other words, I can't simply create a minheap, for example, and ask for the minimum value because the heap's size will change each time depending on the size of the input array. In case someone is thinking about creating a massive data structure to use for each execution of the algorithm (e.g. always create an Array of size MAX_INT for execution, regardless of the size of the input array)  while I guess that this is theoretically constant space, we run into the following problems:
 This is a hugely wasteful approach to solving the problem.
 This solution will fail if we can use a data structure larger than addressable memory, for example a data structure that can be stored on disk.
 I don't believe this is in the spirit of the problem.
On a side note, the final solution will not be pretty because:
 Most of F#'s data structures, by default, are immutable. Luckily, Arrays are not.
 F# encourages a functional programming style and, I believe, the final code will have to be more procedural.
So, how do we solve it?
I thought of a lot of solutions to solve this problem that can meet the given constraints.

Construct a minheap / binary treestyle data structure and find the root (while filtering out negative
values). Violates space complexity

After creating the heap / tree, filter the values to remove those entries where
value + 1
is also in the structure.


Use 2 sets and perform a set difference. Violates space complexity
 Create a set using the original values (call it
Set(orig)
)  Create a set of original values + 1 (call it
Set(plusone)
) 
Do
Set(plusone)  Set(orig)
and find the smallest positive integer
 Create a set using the original values (call it

Sort the array and find the first entry where the following entry's value is not
current entry + 1
. Violates space and time complexity
In other words, using a 0indexed array:
1: 2: 3:
for i in 0 .. array.length  2 do if array[i] + 1 <> array[i + 1] then return array[i] + 1

This can work if I can use a linear time and constant space sort. As far as I know, there is no
such sort.
 Comparison sorts violate time constraint
 Noncomparison sorts violate space constraint


Recognize that if an array has length
n
, the answer can be, at most,n + 1
.
Consider an array that is 1indexed.
 If all the values in the array are
0
or negative, the answer is1
.  If all the values in the array are
>= 2
, the answer is1
.  If the value at each index of the array is the same as the index, the answer is
n + 1
.
 If all the values in the array are
 While I haven't done anything close to a formal proof for this, I believe I am right about this.
 It took me a long time to arrive at this. I had been working on this problem in my head for a couple of weeks (mostly while commuting to and from work) before I came to this realization. I also had to force myself to think procedurally, like in C/C++, to get to this answer.

Consider an array that is 1indexed.
Code
The following is the solution I came up with.
Strategy for the code
At a high level, the code adopts the following strategy:

First pass. Places values in their corresponding indices.
1: 2: 3:
for each entry do while the entry's value is not 0 and is not index + 1 do swap values with the index represented by the value in the entry

Second pass. Find the first entry where the value is
0
, returnindex + 1
.1: 2: 3:
Scan the array for the first entry with value of 0 If found, return index + 1 Else, return array length
When reading the code, there are two items to remember. These may seem obvious, but I had to keep reminding
myself courtesy of 0based arrays in a situation where 0
has special meaning:
 Given an
index
, the value there should beindex + 1
, i.e.value = index + 1
 Given a
value
, the index it belongs to should bevalue  1
, i.e.index = value  1
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 

Alternate solutions
In order to test this solution, we need to write a few alternate solutions to ensure that the results are correct.
I suggested three solutions above, and have implemented two of them below.
Setbased solution
The first alternate solution solves the problem by creating two sets and performing set subtraction. Sets will automatically remove duplicate values, so I do not need to handle that situation explicitly.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 

Sortbased solution
In the second alternate solution, I solve the problem by sorting the array and finding the first pair of values that is not consecutive. Duplicate values are explicitly removed to avoid issues with comparison later.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 

Utility functions
Some utility functions to work with 3tuples and to collect performance data using Windows Performance Counters.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47: 48: 49: 50: 

Testing
Base tests
Considering the "index math" happening in these solutions, we need to thoroughly test them to find any issues. The first tests are to ensure that the two given test cases work correctly for each of the algorithms.
After much struggling with FsCheck, I'm happy to start using a propertybased testing tool that works with Jupyter and FSharp.Literate, Hedgehog.
I am going to use a stopwatch to help measure runtimes since I am using FSharp.Literate instead of Jupyter
for this post, and I can't figure out how to enable the #time
directive through FSharp.Literate.
1: 2: 3: 4: 5: 6: 

First, let's test the main solver.
1: 2: 3: 4: 5: 


1: 2: 3: 4: 5: 


Next, we test the setbased solver.
1: 2: 3: 4: 5: 


1: 2: 3: 4: 5: 


Finally, we can test the sortbased solver.
1: 2: 3: 4: 5: 


1: 2: 3: 4: 5: 


Additional propertybased tests between the algorithms.
Now that the basic smoke tests are out of the way, we can move on to some more serious propertybased testing.
The first test will exercise all three algorithms over the full range of integers.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 


The second test tries various combinations of positive integers over larger arrays.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 


The third test ensures that 1element arrays are handled correctly.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 


And a final edgecase test with empty arrays to ensure that the algorithms are working correctly.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 


And the runtime for the base tests is given below.
1:


1: 2: 3: 4: 

Base Test Runtime was:

Performance Testing
Now that the algorithms seem to be working correctly, the next step is to capture some performance metrics.
For this portion, I will be using System.Diagnostics.Stopwatch
to measure each algorithm's speed.
I believe my last post proved that, as of now, I cannot benchmark memory usage reliably.
First, we can setup a generator to generate random values for testing. Similar to the base testing, let's benchmark the runtime of the test data generation.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 

1: 2: 3: 4: 

Data Generation Runtime was:

Now that we have our profiling data, the next step is to measure performance.
First, let's create a bare minimum set of utility 'tools' to enable the performance monitoring.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 

Solver
Let's start by benchmarking the base solver
algorithm.
1:


1: 2: 3: 4: 


Set Solver
The next benchmark is with the setSolver
algorithm.
1:


1: 2: 3: 4: 


Sort Solver
And finally, the benchmark for the sortSolver
algorithm.
1:


1: 2: 3: 4: 


Results
So, what were the results? First, let's chart the runtimes as measured by the stopwatches (algorithm
Alg
, garbage collection GC
, and total time TT
) and some comparative
percentages.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 

1: 2: 3: 4: 5: 

Performance Counter statistics
While executing the workbook, I collected a number of values using Performance Counters on Windows. These were collected throughout the life of the book's execution. For descriptions of these counters, please refer to the .NET Performance Counters page.

.NET CLR LocksAndThreads
 # of current logical Threads
 # of current physical Threads

.NET CLR Memory
 # Gen 0 Collections
 # Gen 1 Collections
 # Gen 2 Collections
 Gen 0 heap size
 Gen 1 heap size
 Gen 2 heap size
 Large Object Heap size

Process
 Private Bytes
 % Processor Time
I thought that it would be interesting to chart these values out.
However, the first thing we need to do is to stop collecting the data.
1:


.NET CLR LocksAndThreads, # of Current Threads
1: 2: 3: 

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 
