Perl LanguageSorting

Introduction

For sorting lists of things, Perl has only a single function, unsurprisingly called sort. It is flexible enough to sort all kinds of items: numbers, strings in any number of encodings, nested data structures or objects. However, due to its flexibility, there are quite a few tricks and idioms to be learned for its use.

Syntax

  • sort SUBNAME LIST
  • sort BLOCK LIST
  • sort LIST

Basic Lexical Sort

@sorted = sort @list;

@sorted = sort { $a cmp $b } @list;

sub compare { $a cmp $b }
@sorted = sort compare @list;

The three examples above do exactly the same thing. If you don't supply any comparator function or block, sort assumes you want the list on its right sorted lexically. This is usually the form you want if you just need your data in some predictable order and don't care about linguistic correctness.

sort passes pairs of items in @list to the comparator function, which tells sort which item is larger. The cmp operator does this for strings while <=> does the same thing for numbers. The comparator is called quite often, on average n * log(n) times with n being the number of elements to be sorted, so it's important it be fast. This is the reason sort uses predefined package global variables ($a and $b) to pass the elements to be compared to the block or function, instead of proper function parameters.

If you use locale, cmp takes locale specific collation order into account, e.g. it will sort Å like A under a Danish locale but after Z under an English or German one. However, it doesn't take the more complex Unicode sorting rules into account nor does it offer any control over the order—for example phone books are often sorted differently from dictionaries. For those cases, the Unicode::Collate and particularly Unicode::Collate::Locale modules are recommended.

Numeric Sort

@sorted = sort { $a <=> $b } @list;

Comparing $a and $b with the <=> operator ensures they are compared numerically and not textually as per default.

Reverse Sort

@sorted = sort { $b <=> $a } @list;
@sorted = reverse sort { $a <=> $b } @list;

Sorting items in descending order can simply be achieved by swapping $a and $b in the comparator block. However, some people prefer the clarity of a separate reverse even though it is slightly slower.

The Schwartzian Transform

This is probably the most famous example of a sort optimization making use of Perl's functional programming facilities, to be used where the sort order of items depend on an expensive function.

# What you would usually do
@sorted = sort { slow($a) <=> slow($b) } @list;

# What you do to make it faster
@sorted =
map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [ $_, slow($_) ] }
@list;

The trouble with the first example is that the comparator is called very often and keeps recalculating values using a slow function over and over. A typical example would be sorting file names by their file size:

use File::stat;
@sorted = sort { stat($a)->size <=> stat($b)->size } glob "*";

This works, but at best it incurs the overhead of two system calls per comparison, at worst it has to go to the disk, twice, for every single comparison, and that disk may be in an overloaded file server on the other side of the planet.

Enter Randall Schwartz's trick.

The Schwartzian Transform basically shoves @list through three functions, bottom-to-top. The first map turns each entry into a two-element list of the original item and the result of the slow function as a sort key, so at the end of this we have called slow() exactly once for each element. The following sort can then simply access the sort key by looking in the list. As we don't care about the sort keys but only need the original elements in sorted order, the final map throws away the two-element lists from the already-sorted list it receives from @sort and returns a list of only their first members.

Case Insensitive Sort

The traditional technique to make sort ignore case is to pass strings to lc or uc for comparison:

@sorted = sort { lc($a) cmp lc($b) } @list;

This works on all versions of Perl 5 and is completely sufficient for English; it doesn't matter whether you use uc or lc. However, it presents a problem for languages like Greek or Turkish where there is no 1:1 correspondence between upper- and lowercase letters so you get different results depending on whether you use uc or lc. Therefore, Perl 5.16 and higher have a case folding function called fc that avoids this problem, so modern multi-lingual sorting should use this:

@sorted = sort { fc($a) cmp fc($b) } @list;