../fallback-as-ordering

Published on: 2024-09-19

Fallbacks as Implicit Ordering

Fallbacks as Implicit Ordering
- Introduction
- Naive Fallback Logic
- Canonical Fallback Logic
  - Getting Concrete
- Conclusion
  - Credits

> wc -w ~ 1852
Reading time estimate ~ 10 minutes

Introduction

A common thing in software engineering is to fallback through a series of outcomes depending on a condition. This blog post aims to discuss some different ways to express this in Scala, and maybe suggest some forms of expression not all Scala programmers have thought about.

Naive Fallback Logic

Consider the following sketch of code, ignoring for now any concrete situation that would lead us to write it:

if (thisThing) someOutcome
else if (that) anotherOutcome
else if (another) thenThisOtherOutcome
else someDefaultOutcome

We could call this kind of code naive fallback¹. Here, there are some finite number of conditions, and we want to test each in order, returning the appropriate outcome depending on which one in fact holds.

For example, it could be the case that we have a variety of data sources we're ingesting from, and depending on what the source of the data was we need to delegate to a different function, but some data sources take priority over others. Imagine, for the sake of argument, that different sources also imply different steps to be taken before they are finally stored in whatever canonical source of truth we are working with.

Consider a function like the following that given four pieces of data from four sources takes the most preferable, relative to the author's criteria (whatever those were).

// We might imagine we have two different external sources with different freshness
// and two different internal procedures with different levels of confidence, for example.
def getFreshest[A](
  externalFresh: Option[A],
  externalStale: Option[A],
  internalHigh: Option[A],
  internalLow: Option[A]
): Option[A] =
  if (externalFresh.isDefined) externalFresh
  else if (externalStale.isDefined) externalStale
  else if (internalHigh.isDefined) internalHigh
  else internalLow

This code is fine in the sense that it does what it intends to do, but I don't think it's very idiomatic.

Importantly, there are not a lot of types here to help us. The concept of the different sources is left entirely implicit in the names of the arguments, something that is not meaningful to the compiler.

As such, it is not very self documenting to future readers, and it is pretty flabby from a correctness standpoint. Toward that latter perspective, it would be quite easy to mix up the priority between different sources.

We could start cleaning up by reifying our sources.

Something like this maybe:

sealed trait DataSource
object DataSource {
  case object ExternalFreshest extends DataSource
  case object ExternalStale extends DataSource
  case object InternalHighConfidence extends DataSource
  case object InternalLowConfidence extends DataSource
}

def getFreshest[A](sourced: Map[DataSource, A]): Option[A]=
  sourced.get(DataSource.ExternalFreshest)
    .orElse(sourced.get(DataSource.ExternalStale))
    .orElse(sourced.get(DataSource.InternalHighConfidence))
    .orElse(sourced.get(DataSource.InternalLowConfidence))

This code has a lot of advantages. For example, it's readily apprehensible to most readers directly and requires few new concepts to follow. It also is a bit more declarative in style than the first version we looked at. Since we used a Map, we're guaranteed to have at most one value for each source, so nothing weird can happen. I think most would find it reasonably idiomatic.

It has disadvantages, too, however. Importantly, if we were to add a new data source it might be easy to forget to update, and the compiler would not warn us as we have no exhaustivity checking in this function.

You might think that's fine with just one of these functions. After all, how hard could it be to remember to update one function? Think though: what if we had not one of these functions but dozens, a whole system of such functions?

That might easily occur if we had to fetch values from structured nested data from each source, any field of which could be missing in one or another source, making it impracticable to build new maps for each field.

Clearly, reproducing this check in each one would inevitably lead to error somewhere. All it would take is mixing up one case in one place.

Ideally, we would have a way to make the fallthrough logic canonical across the whole system, and if possible make its exhaustivity compiler checked.

Canonical Fallback Logic

To recap, our goals are:

to define the relationship between our sources only once
to have that relationship be compiler checked for exhaustivity
to be able to pull arbitrary values from arbitrary depths in some nested data relative to the relationship between our sources

As is often the case in functional programming, the first step is to take something implicit and make it explicit. We'll do that by defining an order for our sources.

The library cats defines a typeclass for orders like so:

trait Order[@sp A] extends Any with PartialOrder[A] { self =>
  def compare(x: A, y: A): Int
}

This is meant to capture the concept of a total order which is a relationship amongst all elements such that all are comparable, the relationship is transitive, and the relationship is antisymmetric².

We can quickly define our order using the by helper provided by cats.

import cats.kernel.Order
import cats.instances.int._

sealed trait DataSource
object DataSource {
  case object ExternalFreshest extends DataSource
  case object ExternalStale extends DataSource
  case object InternalHighConfidence extends DataSource
  case object InternalLowConfidence extends DataSource

  // This is the new bit
  implicit val order: Order[DataSource] = Order.by({
    case ExternalFreshest => 4
    case ExternalStale => 3
    case InternalHighConfidence => 2
    case InternalLowConfidence => 1
  })
}

This new instance defines the order relation on our data sources in terms of their assigned priority, represented as an integer. A higher priority means the data is to be preferred if it comes from that source.

By doing this, the mapping between data sources and their priority is centralized and made canonical. We have done it once for all time.

Since we defined the mapping into the integers with a pattern match, this mapping is also exhaustivity checked by the compiler. If we were to add a new datasource, the compiler would complain that we haven't explained how it's priority is to be assigned.

This all seems pretty good! We've got two of our goals already, right here. But from another perspective, it might appear to a reasonable person that we've taken a step backwards. Afterall, it might not be obvious how we would actually use this to decide what data we want.

Let's start with the simplest case, and just rewrite our previous function. We might start by trying this:

def getFreshest[A](sourced: Map[DataSource, A]): Option[A] =
  sourced.toVector.maximumByOption(_._1).map(_._2)

What happened here is straightforward enough - we first convert to a Vector because Vector implements Foldable. The Foldable API includes the method maximumByOption that we used, which leverages our Order instance to assign the priority and find the highest priority element³. We then peeled the actual value we wanted off of the tuple of a data source and value.

Getting Concrete

This is good progress, but let's pose ourselves a concrete problem so that we can think through how to extend this to a more complicated situation. Let's imagine we have a service that integrates with a couple music platforms, perhaps Spotify and Tidal⁴, and for a couple others, like Apple Music and YouTube Music, we have internal estimates. It's a bit silly, but imagine for whatever reason we want to take one report as canonical, and treat the rest as less representative⁵.

We can define some sources:

sealed trait ListeningService
object ListeningService {
  sealed trait External extends ListeningService
  object External {
    case object Spotify extends External
    case object Tidal extends External
  }
  sealed trait Internal extends ListeningService
  object Internal {
    case object AppleMusic extends Internal
    case object YoutubeMusic extends Internal
  }
  // Convenience collection of values
  val values: Vector[ListingService] =
    Vector(
      External.Spotify,
      External.Tidal,
      Internal.AppleMusic,
      Internal.YoutubeMusic
    )

And then break out their priority:

  // Continuing the snippet above:
  // Assign our priority order to listening sources
  // Imagine that maybe we prefer Spotify data to Tidal data due to better user coverage
  // and we prefer AppleMusic to YoutubeMusic internal estimates b/c our model is better
  // if it strikes you as implausible, try to treat it abstractly
  implicit val order: Order[DataSource] = Order.by({
    case ListeningService.External.Spotify => 4
    case ListeningService.External.Tidal => 3
    case ListeningService.Internal.AppleMusic => 2
    case ListeningService.Internal.YoutubeMusic => 1
  })
}

And finally start sketching out some data reported by those sources:

// In this schema, `None` means "cannot report" while an empty `Map` would
// for example _positively report_ that no listens occurred.
// I won't pretend this representation is maximially felicitous, it's just meant to draw
// out the relevant idea. For example, the `Int` counts should really be `NonNegative`,
// and artists names probably shouldn't be `String`.
// (This is also more homogeneous in structure that you would probably see in practice).
sealed trait Genre
object Genre {
  case object Rap extends Genre
  case object Indie extends Genre
  case object Country extends Genre
  case object Metal extends Genre
}
case class GenreDatum(
  genre: Genre,
  timeListened: Option[Int],
  songs: Option[Map[String, Int]],
  artists: Option[Map[String, Int]]
)
case class ListeningReport(
  timeListened: Option[Int],
  genreData: Map[Genre, GenreDatum]
)

Now we want a way to package up a piece of information with it's source:

case class SourcedData[A](source: ListeningService, value: Option[A])
object SourcedData {
  implicit def order[A] = {
    implicit val preferPresence: Order[Option[A]] = Order.by({
      case Some(_) => 1
      case None => 0
    })
    Order.by({ case SourcedData(s, v) => (v, s) })
  }
}

Basically, we'll say to rely on the prioritizing presence over absence in the option first⁶, and then the priority of the source, in lexicographic order. Why we say that, should become clear in a second.

Here's how we can get the freshest country listening time for our user:

def countryListeningTime(reports: Map[ListeningService, ListeningReport]]): Option[Int] =
  ListeningService.values.map(source =>
    SourcedData(
      source,
      for {
        report <- reports.get(source)
        country <- report.genreData.get(Genre.Country)
        timeListened <- country.timeListened
      } yield timeListened
  ).maximumOption
   .flatMap(_.value)

There you have it. For each service, we try to get the relevant data and associate it with that source. If data is present, then it will have priority over missing data, because we defined Some to have priority over None. Then, for those that are present, the priority of the source will determine which takes priority.

So, the maximum will be the one that is present with highest source priority, and if none are present we get no answer, as desired. Actually, we can make this a simple general purpose utility:

type PriorityFetcher[A] = Map[ListeningService, ListeningReport] => Option[A]
def priorityStatisticFetcher[A](fetcher: ListeningReport => Option[A]): PriorityFetcher[A] =
  reports =>
    ListeningService.values.map(source =>
      SourcedData(
        source,
        reports.get(source).flatMap(fetcher)
      )
    ).maximumOption
     .flatMap(_.value)

// Turning the last example into:
val countryListeningTime: PriorityFetcher[Int] =
  priorityStatisticFetcher(
    _.genreData.get(Genre.Country)
      .flatMap(_.timeListened)
  )

So, with a little bit of centralization, we can write some machinery allowing for every one of our priority fallback data fetching functions to be one liners. Not so bad! Expecially if, as I threated above, we had a few dozen such!

Conclusion

In this blog post I tried to motivate a way of modeling fallback logic that centralizes the priority structure, and introduces greater explicitness into the text of the code.

Note that I didn't suggest you should use this in every single situation, on the contrary, I think if you introduced an Order instance for every single fallback expression you would justly be accused of over-engineering! Sometimes you just need an if or a match.

On the other hand, I hope I suggested the power of this approach when there are repeated implicit reliances on some kind of ordering.

I think this points to a general lesson of typed functional programming - taking something implicit and reifying it into the text of the code as something explicit can often produce benefits in clarity and expressive power.

Thanks for reading!

Credits

Thanks to Jenifer Carter for comments on an earlier draft.

Though, for the record, as I hope the rest of the discussion makes clear, naive is not here meant as a pejorative.

If these terms aren't familiar to you, they amount to the following essentially:

if x ≤ y and y ≤ x, then x = y (antisymmetry)
if x ≤ y and y ≤ z, then x ≤ z (transitivity)
for all x, y either x ≤ y or y ≤ x (totality)

Which, if you think about natural numbers, should feel fairly familiar as rules.

If some of the mechanics of the typeclass machinery and implicits confuse you, don't worry you're in good company. There are some good blog posts out there on the subject, and I'm planning to write one shortly. Until then, maybe you can try some of these.

⁴

Does that still exist or did Jay-Z give up? I won't be looking it up.

⁵

Probably a more obviously natural domain is financial data, like transaction data, but I'm avoiding that kind of example on purpose. You can close that loop in your own mind if you like.

⁶

You might justifably have worries about the lawfulness or naturalness of this instance - indeed it is not lawful. The reason it is not is straightforward, we lose the information contained in the Some cases. If you want, you can run something like this test that validates that it is lawful except for antisymmetry! That means it is actually a total preorder.

While the API requires an Order and there is no Preorder typeclass in cats, you can define a maximal element in a total preorder easily enough. The tradeoff there being that there might be more than one maximal element in the set. In this concrete case, that is not possible because our data is keyed off the source, and those are exhaustive and represented at most once in any given call to maximumOption. Whether or not you are cool with this bit of legerdemain is a matter of taste I think. For what it is worth, I aim to be lawful whenever possible - the presentation here is meant to focus on the meaningful elements of the instance.

If you are very concerned with lawfulness, you can use the default Order for options, since in this case it cannot alter the outcome (again due to the unique keys of the source):

// This is equivalent to `cats.instances.option.catsKernelStdOrderForOption`
implicit def orderForOption[A: Order] = Order.from({
  case (Some(a), Some(b) => Order[A].compare(a, b)
  case (Some(_), None) => 1
  case (None, Some(_)) => -1
  case (None, None) => 0
})

You'd probably want to write a comment about that uniqueness invariant though, so people know how to use the machinery.

/scala/ /functional-programming/ /order/ /design/