Preamble

This post is the first of a series of 3 posts about natural sorting and an open source library for Xojo® I wrote recently and named NaturalSortLib. The library is available as a Github repository and licensed under the very permissive MIT License.

What is natural sorting?

Sorting multiple strings means comparing them 2 by 2 and decide which precedes the other. Standard comparison algorithm compare the strings character by character based on their numerical value defined by the string encoding until they find a difference. Then the precedence is defined by the lowest numerical value coming first. Unfortunately, this has a side effect when it comes to sorting numeric values.

Let's consider this array of strings:

  • foo342
  • foo4
  • foo26
  • foo1
  • foo3
  • foo12

If you use a standard comparison to sort them, you'll get this

  • foo1
  • foo12
  • foo26
  • foo3
  • foo342
  • foo4

The problem is obvious. The simplest way to work around it is to add padding zeros before the digits, to enforce a correct sorting as each digit will be placed in its correct significance column: thousands, hundreds, decades, unities.

  • foo0001
  • foo0003
  • foo0004
  • foo0012
  • foo0026
  • foo0342

What if you can't modify the strings? One of the solution is to copy the string with the correct padding in another array and use the SortWith() method provided by Xojo® to sort the original array. Now, how to handle strings like "foo123bar45"? We need a more generic way to achieve this natural sorting. Unfortunately, natural sorting is not part of Xojo®'s API. So we'll have to devise our own custom solution.

Back to basics

If you're manually sorting the strings above, you consider two kinds of data in the string: the text part(s) and the numeric part(s) for each string and then you compare the different part in the following ways:

Numeric against numeric
This is the simplest case. You just compare the numeric value of each side and signal that the the lowest value precedes the other one.
Text against text
For this one, character by character comparison and alphabetical order is the reference. For computer strings, it's the order of characters as defined by the encoding standard that prevails.
Text against numeric
Since the dawn of computing, when ASCII, with its 95 printable characters, was the only text encoding, digits have always precedes letters, it's commonly admitted that numeric comes first.

If we transpose this in code, we can have split each String in an array of "chunks" of two types: numbers and characters. Then comparing is just a matter of comparing each chunk one by one until we find one that precedes the other. It sounds like we have a plan... Time to do some coding!

StringChunk on the menu

As we are writing a library, let's start by creating a module in Xojo®'s IDE and name it NaturalSortLib, insert a new class in the module and name it StringChunk. This class has no super and will hold two kinds of data, so we create an enumeration named Types and give it two elements: Number and Characters:

Public Enum Types As Integer
  Number
  Characters
End Enum

As a StringChunk instance will have to hold data that can be either numerical or textual, we use an Auto to store them and a Type property defined as StringChunk.Types:

Public Property Type as StringChunk.Types
Public Property Value as Auto

We can simplify code by storing the data while creating the instance of StringChunk with this constructor:

Public Sub Constructor(inValue As Auto, inType As StringChunk.Types)
  Value = inValue
  Self.Type = inType
End Sub

The heart of this class will resides in a single method named... Compare() that compares a StringChunk instance to another one. The principe is simple, it takes one parameter inRightSide As StringChunk and return an Integer which value tells us about the result of the comparison between the instance calling the method (i.e. the left side) and the instance passed as a parameter (i.e. the right side). If the value is negative the the left side precedes the right side, if its positive, the right side precedes the left side, and if it's 0, then both sides are equals.

Public Function Compare(inRightSide As StringChunk) as Integer
  If Self.Type = StringChunk.Types.Number And inRightSide.Type = StringChunk.Types.Number Then
    // The simplest case
    Return Ctype( Self.Value, Integer ) - Ctype( inRightSide.Value, Integer )
    
  Elseif Self.Type = StringChunk.Types.Characters And inRightSide.Type = StringChunk.Types.Characters Then
    Dim theLeftSide As String = CType( Self.Value, String )
    Dim theRightSide As String = CType( inRightSide.Value, String )
    
    If theLeftSide > theRightSide Then
      Return 1
      
    Elseif theLeftSide < theRightSide Then
      Return -1
      
    Else
      Return 0
      
    End If
    
  Else
    // In a natural comparison, numbers always come first
    Return If( Self.Type = StringChunk.Types.Characters, 1, -1 ) 
    
  End If
End Function

Finally, as a convenience we can add an operator overloading method for comparison to our class: Operator_Compare().

Public Function Operator_Compare(inRightSide As NaturalSortLib.StringChunk) as Integer
  Return Self.Compare( inRightSide )
End Function

The StringChunk class is complete. Next week, in part 2, we'll see how to split a string in an array of StringChunks and how to implement a simple cache mechanism to avoid reprocessing a string when it has been done before.

Leave a Reply

Your email address will not be published. Required fields are marked *

3 × three =