This post is the first of a series of 3 posts about natural sorting and an open source library for Xojo® I wrote recently and named
NaturalSortLib. The library is available as a Github repository and licensed under the very permissive MIT License.
What is natural sorting?
Sorting multiple strings means comparing them 2 by 2 and decide which precedes the other. Standard comparison algorithm compare the strings character by character based on their numerical value defined by the string encoding until they find a difference. Then the precedence is defined by the lowest numerical value coming first. Unfortunately, this has a side effect when it comes to sorting numeric values.
Let's consider this array of strings:
If you use a standard comparison to sort them, you'll get this
The problem is obvious. The simplest way to work around it is to add padding zeros before the digits, to enforce a correct sorting as each digit will be placed in its correct significance column: thousands, hundreds, decades, unities.
What if you can't modify the strings? One of the solution is to copy the string with the correct padding in another array and use the
SortWith() method provided by Xojo® to sort the original array. Now, how to handle strings like "foo123bar45"? We need a more generic way to achieve this natural sorting. Unfortunately, natural sorting is not part of Xojo®'s API. So we'll have to devise our own custom solution.
Back to basics
If you're manually sorting the strings above, you consider two kinds of data in the string: the text part(s) and the numeric part(s) for each string and then you compare the different part in the following ways:
- Numeric against numeric
- This is the simplest case. You just compare the numeric value of each side and signal that the the lowest value precedes the other one.
- Text against text
- For this one, character by character comparison and alphabetical order is the reference. For computer strings, it's the order of characters as defined by the encoding standard that prevails.
- Text against numeric
- Since the dawn of computing, when ASCII, with its 95 printable characters, was the only text encoding, digits have always precedes letters, it's commonly admitted that numeric comes first.
If we transpose this in code, we can have split each String in an array of "chunks" of two types: numbers and characters. Then comparing is just a matter of comparing each chunk one by one until we find one that precedes the other. It sounds like we have a plan... Time to do some coding!
StringChunk on the menu
As we are writing a library, let's start by creating a module in Xojo®'s IDE and name it
NaturalSortLib, insert a new class in the module and name it
StringChunk. This class has no super and will hold two kinds of data, so we create an enumeration named
Types and give it two elements:
Public Enum Types As Integer Number Characters End Enum
As a StringChunk instance will have to hold data that can be either numerical or textual, we use an
Auto to store them and a
Type property defined as
Public Property Type as StringChunk.Types Public Property Value as Auto
We can simplify code by storing the data while creating the instance of
StringChunk with this constructor:
Public Sub Constructor(inValue As Auto, inType As StringChunk.Types) Value = inValue Self.Type = inType End Sub
The heart of this class will resides in a single method named...
Compare() that compares a
StringChunk instance to another one. The principe is simple, it takes one parameter
inRightSide As StringChunk and return an
Integer which value tells us about the result of the comparison between the instance calling the method (i.e. the left side) and the instance passed as a parameter (i.e. the right side). If the value is negative the the left side precedes the right side, if its positive, the right side precedes the left side, and if it's 0, then both sides are equals.
Public Function Compare(inRightSide As StringChunk) as Integer If Self.Type = StringChunk.Types.Number And inRightSide.Type = StringChunk.Types.Number Then // The simplest case Return Ctype( Self.Value, Integer ) - Ctype( inRightSide.Value, Integer ) Elseif Self.Type = StringChunk.Types.Characters And inRightSide.Type = StringChunk.Types.Characters Then Dim theLeftSide As String = CType( Self.Value, String ) Dim theRightSide As String = CType( inRightSide.Value, String ) If theLeftSide > theRightSide Then Return 1 Elseif theLeftSide < theRightSide Then Return -1 Else Return 0 End If Else // In a natural comparison, numbers always come first Return If( Self.Type = StringChunk.Types.Characters, 1, -1 ) End If End Function
Finally, as a convenience we can add an operator overloading method for comparison to our class:
Public Function Operator_Compare(inRightSide As NaturalSortLib.StringChunk) as Integer Return Self.Compare( inRightSide ) End Function
StringChunk class is complete. Next week, in part 2, we'll see how to split a string in an array of
StringChunks and how to implement a simple cache mechanism to avoid reprocessing a string when it has been done before.