Previously in Natural Sort

In the first part, we talked about what is natural sorting and discussed one of the many ways to implement it. We also wrote a class named StringChunk that will help us to implement our natural comparison algorithm.

Slicing a String

Before comparing strings we need to identify the text and the numeric parts in the string. The algorithm is very simple: iterate each character of the string, if we find a character that is not a digit, start a Characters chunk, if it's a digit, create a new Number chunk. Process the next character. If it's of the same kind than the previous one, then store it in a buffer and continue. If it's not of the same type, then store the current buffer as a chunk part alongside it's type, then create a new buffer and set the correct chunk type for it. Repeat the process until the end of the string.

Let's use the following string as an example: "image547851-version2.jpg" and that's what we want to get from our algorithm:

Natural Chunks for "image547851-version2.jpg"
String Part Type
image Characters
547851 Numeric
-version Characters
2 Numeric
.jpg Characters

Put it in source code

Let's take a look at the the structure of our algorithm and the variables we need to implement it. The structure is mostly a loop where we handle each character one by one. For this purpose we need a buffer to store the characters of the same kind while we are collecting them until we found a different kind of character(s). We also need to keep track of the type of chunk being collected. The output condition for the loop is when there is no more character to process. So we need to know the total number of characters in the string and an index to keep a track of which character we are processing. Finally, We'll need a boolean flag to handle the first character. We can't just test if i = 1 because if there are control characters at the beginning of the string, they will be filtered out.

The algorithm is implemented as a private method of the NaturalSortLib module because it's for internal use only.

Private Function NaturalSplit(inString As String) as StringChunk()
  Dim theResult() As StringChunk

  // No need to handle an empty string
  If inString = "" Then Return theResult

  // Setup a few variables
  Dim theBuffer() As String
  Dim theBufferType As StringChunk.Types
  Dim theCharCount As Integer = inString.Len
 
  // Let's process each character one at a time
  // The output condition for the loop is when there is no more character to process
  Dim IsFirstCharacter As Boolean = True
  Dim i As Integer = 1
  Do
    // Retrieve the char and its code
    Dim theChar As String = inString.Mid( i, 1 )
    
    // Process the character here....
    [...]

    // Next character
    i = i + 1
    
  Loop Until i > theCharCount
End Function

That's how our code will look like. Once we have the character, let's use its integer code point to determine what to do with it. The first thing is to filter out control codes as they do not have a meaning in natural sorting. Next, we determine if the character is a digit or a non-digit character. Next step is to decide what to do with the character:

  • If it's the first character to be processed, we store the character in the buffer, set theBufferType according to the type and lower the first character flag.
  • If this character has the same type than the current buffer, just store it.
  • If it's not the same type, process the buffer to create a chunk, store it alongside its type then create a new buffer with the character and update the type.
The processing part of the loop should look like this:
Note the use of the ternary/trinary operator If() to determine the type of the character.

// Skip control codes ( ie. ASCII code 0-31 )
If theCharCode > 31 Then
  // Determine the character type
  Dim theCharType As StringChunk.Types = If( theCharCode > 47 And theCharCode < 58, _
      StringChunk.Types.Number, StringChunk.Types.Characters )

  // How to handle it?
  If isFirstCharacter Then
    // Handle the first real character ( ie. not a control code  )
    theBuffer.Append theChar
    theBufferType = theCharType
    isFirstCharacter = False

  Elseif theCharType = theBufferType Then
      // Add the character to the buffer
      theBuffer.Append theChar

  Else
    // the type has changed, store the chunk and its type
    If theBufferType = StringChunk.Types.Number Then
      theResult.Append New StringChunk( Val( Join( theBuffer, "" ) ), theBufferType )

    Else
      theResult.Append New StringChunk( Join( theBuffer, "" ), theBufferType )

    End If
        
    // Reset the buffer with the new character
    Redim theBuffer( 0 )
    theBuffer( 0 ) = theChar
    theBufferType = theCharType

  End If
      
End If

Right after the loop, the last chunk is processed if needed, and the result of split is returned as an array of StringChunk.

// Add the last buffer if needed
If theBuffer.Ubound > -1 Then
  // Store the chunk
  If theBufferType = StringChunk.Types.Number Then
    theResult.Append New StringChunk( Val( Join( theBuffer, "" ) ), theBufferType )

  Else
    theResult.Append New StringChunk( Join( theBuffer, "" ), theBufferType )

  End If
  
End If
  
// Return the array
Return theResult

Efficiency is intelligent laziness.

Sorting is comparing. And the more items you have to sort, the more comparisons you make. It also means that a string will be compared multiple times with other strings. In order to avoid redoing what has already be done, let's implement a simple cache to store the result of a natural split. For this purpose, we can use a dictionary, the one from the new framework: Xojo.Core.Dictionary. And to later retrieve the result of the splitting for a particular string, the key of choice is the string itself.

The modifications to our code are very simple. First, we have to add a property to NaturalSortLib module that will hold the said Dictionary.

Public Property pNaturalSplitCache as Xojo.Core.Dictionary

At the beginning of the NaturalSplit() method, we create an instance of the Xojo.Core.Dictionary if needed, otherwise we check if there already is an entry for the string to split.

  • If yes, we retrieve this entry and return it after a proper type casting
  • Otherwise, we process the entire string and store the result in the cache before returning it, so it will be available the next time we use this string.
In terms of optimization, we can test for empty string and return an empty array in this case. Now our NaturalSplit() method should look like this:

Private Function NaturalSplit(inString As String) as StringChunk()
  #pragma DisableBoundsChecking
  
  Dim theResult() As StringChunk
  
  // No need to handle an empty string
  If inString = "" Then Return theResult
  
  // Create the cache dictionary if needed
  If pNaturalSplitCache Is Nil Then pNaturalSplitCache = New Xojo.Core.Dictionary

  // Do we have this string in cache ?
  If pNaturalSplitCache.HasKey( inString ) Then 
    // We have it cached, retrieve it
    theResult = pNaturalSplitCache.Value( inString )
    
  Else
    // No cache entry, build the array from the string
    // Setup a few variables
    Dim theBuffer() As String
    Dim theBufferType As StringChunk.Types
    Dim theCharCount As Integer = inString.Len
    
    // Let's process each character one at a time
    // The output condition for the loop is when there is no more character to process
    Dim IsFirstCharacter As Boolean = True
    Dim i As Integer = 1
    Do
      // Retrieve the char and its code
      Dim theChar As String = inString.Mid( i,1 )
      Dim theCharCode As Integer = theChar.Asc
      
      // Skip control codes ( ie. ASCII code 0-31 )
      If theCharCode > 31 Then
        // Determine the character type
        Dim theCharType As StringChunk.Types = If( theCharCode > 47 And theCharCode < 58, _
            StringChunk.Types.Number, StringChunk.Types.Characters )
          
          // How to handle it?
          If isFirstCharacter Then
            // Handle the first real character ( ie. not a control code  )
            theBuffer.Append theChar
            theBufferType = theCharType
            isFirstCharacter = False
            
          Elseif theCharType = theBufferType Then
            // Add the character to the buffer
            theBuffer.Append theChar
            
          Else
            // the type has changed, store the chunk and its type
            If theBufferType = StringChunk.Types.Number Then
              theResult.Append New StringChunk( Val( Join( theBuffer, "" ) ), theBufferType )
              
            Else
              theResult.Append New StringChunk( Join( theBuffer, "" ), theBufferType )
              
            End If
            
            // Reset the buffer with the new character
            Redim theBuffer( 0 )
            theBuffer( 0 ) = theChar
            theBufferType = theCharType
            
          End If
          
        End If
        
        // Next character
        i = i + 1
        
    Loop Until i > theCharCount 
    
    // Add the last buffer if needed
    If theBuffer.Ubound > -1 Then
      // Store the chunk
      If theBufferType = StringChunk.Types.Number Then
        theResult.Append New StringChunk( Val( Join( theBuffer, "" ) ), theBufferType )
        
      Else
        theResult.Append New StringChunk( Join( theBuffer, "" ), theBufferType )
        
      End If
      
    End If
    
    // Add the entry to the cache
    // The cache can't be Nil here since it has been created
    // at the beginning of the method if it was missing
    pNaturalSplitCache.Value( inString ) = theResult
    
  End If
  
  // Return the array
  Return theResult
End Function

One More Thing...

Once the sort is done, the cache may no longer be needed. So we leave it to the developer to decide when to get rid of the cache content and to free memory by providing a protected method ClearNaturalSplitCache() that removes all the entries in the Dictionary.

Protected Sub ClearNaturalSplitCache()
  //-- Remove all entries in the split cache
  
  If Not ( pNaturalSplitCache Is Nil ) Then
    pNaturalSplitCache.RemoveAll
    
  End If
End Sub

Stay Tuned For Scenes From My Next Episode

In the third and last part of this series, we'll see how to compare two StringChunk instances and how it can help us to implement natural sort in a ListBox by using the ListBox.CompareRows() event.

Any comment or suggestion you may have is welcome. Feel free to use the form below to leave your comment.

Leave a Reply

Your email address will not be published. Required fields are marked *

six − three =