Case Study Of Differential Data Querying In .NET Core Web API

30 September 2020

As you may know we were asked to develop system for attendees of Economic Forum 2020. One of the client’s main requests was to allow mobile applications, communicating with main server, to read only the data that has been changed since last reading. We were given the read-only access to full dataset held on client’s legacy server (stored in XML file). Later that day we came up with a potential solution, shown below.

Presented diagram points out two main parts of the entire process.

Efficient (due to frequency of readouts) transforming entire data from legacy server into the structure that allows part 2 do its job.
Allowing users to fetch only the data that has changed after provided timespan.

For the API’s technology stack we chose C#, ASP.NET Core, Entity Framework Core and Hangfire as background jobs runner.

Part 1 – cyclic legacy data transformation

I will ignore the real XML’s content as it’s completely irrelevant to explain the main idea behind the solution. Let’s assume we import list of events from XML below:

<?xml version="1.0" encoding="UTF-8"?>
<events>
  <event id="1" startDate="2020-01-01 12:00" endDate="2020-01-01 13:45" topic="Topic 1">
    <participants>
      <participant fullName="Jon Doe" />
      <participant fullName="George Wahlin" />
    </participants>
  </event>
  <!-- more data -->
  <event id="10000" startDate="2020-01-11 12:00" endDate="2020-01-11 13:45" topic="Topic 10000">
    <participants/>
  </event>
</events>

That list contains information for the conference’s agenda, such as the topic, time and participants.

Our entity (which is also the entity in DbContext) representing single entry from incoming XML looks like this.

public class Event
{
    public int Id { get; set; }
    public int ExternalId { get; set; }

    public DateTime StartDate { get; set; }
    public DateTime EndDate { get; set; }
    public string Topic { get; set; }
    public List<string> Participants { get; set; } = new List<string>();

    public int Hash { get; set; }

    public DateTime ModifiedDate { get; set; }
    public DateTime? DeletedDate { get; set; }
}

Key factor in our scenario is ability to uniquely identify which entry from XML is which in our database. For that purpose we hold incoming id attribute under ExternalId property and use it as the key for this particular process. We could argue whether or not it should be Id of our entity but all this comes down to individual cases we design our system for.

Once we can successfully match incoming entry with our stored one, we need to tell what has changed. Is the entity new? Is it removed? Is it modified? All that can be easily done with set operators, and utilizing the ExternalId and (not yet explained) Hash property.

The idea behind introducing custom hash values is for efficient determining which data kept in our database differs, without producing bulky and time-consuming SQL queries to check all properties or, even worse, potential tree structures of related sub-objects. We need to remember that this process is being triggered every few minutes, so if it takes longer than that, it won’t be able keep up (as it can’t be ran in parallel). Core concept was to work just against two indexed columns to minimize not only the mentioned risk but also memory usage. But why did we introduce new artificial hash values and didn’t override existing ones? We needed to have complete control over marking given entity as different than its sibling in database, but that control was limited only to that process. Overriding provided .NET equality mechanism would make it very easy to introduce bugs very difficult to find.

Next question is: What does “different” mean?. Obviously, the two sibling events differ if, for example, their topics do, but what of participants? Order doesn’t matter in our case, thus our hash calculations must be independent of it.

All things considered, we can write function for hash calculation:

public int CalculateHash(Event e)
{          
    var hash = 17;
    var m = 23;

    unchecked
    {
        hash = hash * m + e.StartDate.GetHashCode();
        hash = hash * m + e.EndDate.GetHashCode();
        hash = hash * m + (e.Topic?.GetHashCode()).GetValueOrDefault(0);

        foreach (var participant in e.Participants
            .Where(p=>!String.IsNullOrEmpty(p))
            .OrderBy(p => p))
        {
            hash = hash * m + participant.GetHashCode();
        }
    }

    return hash;
}

We apply this function to all incoming entries from XML (initial values set to hash and m variables are arbitrary prime numbers). Once we have all data mapped and their hash values calculated, we can do three queries to the database to tell easily what’s new, modified and removed. Since we depend on ExternalId as the key in this process (that’s the assumption we have made prior to tackling this topic), we can write three simple queries to categorize incoming data. New data constitutes all entries with no siblings in our database. We don’t even need to check for their hashes at this point. Modified entries are those entries which do have siblings, but their hashes differ. Entries that have been removed are the ones that do not exist in the incoming data, yet they do in our database. Since all three checks are based on the data we have already stored, we can cache it with a single query, and continue from there.

var dbDataIds = await DbContext.Events.Select(e => new { e.ExternalId, e.Hash }).ToListAsync();

var inputDataIds = inputEvents.Select(e => new { e.ExternalId, e.Hash }).ToList();

Now that we have these two lists, we can calculate all previously mentioned sets. Note that we don’t fetch all columns from the database, just the bare minimum to do all calculations in memory to make it as fast as possible without unnecessary queries. Since we read only two integers per row, and we know (thanks to business analysis) the possible data volume in this table, we can safely do that and utilize server memory for quick processing.

var dataToAddIds = inputDataIds
    .Select(d => d.ExternalId)
    .Except(dbDataIds.Select(d => d.ExternalId));

var dataToRemoveIds = dbDataIds
    .Select(d => d.ExternalId)
    .Except(inputDataIds.Select(d => d.ExternalId));

var dataToModifyIds = dbDataIds
    .Where(e => inputDataIds.Any(d => d.ExternalId == e.ExternalId && d.Hash != e.Hash))
    .Select(e => e.ExternalId);

Having these lists of identifiers in place, we can easily find what to add, modify and delete. Except that we don’t actually just delete nor just update… Remember what the entity looked like? Apart from Hash property, it contains also two DateTime properties. These are the holy grail of point 2 working as intended and we mustn’t forget about setting them with correct values. You can check the entire function comparing incoming data and updating database below.

public async Task SyncEvents(IEnumerable<Event> inputData)
{
    var now = DateTime.Now;

    var dbDataIds = await DbContext.Events.Select(e => new { e.ExternalId, e.Hash }).ToListAsync();
    var inputDataIds = inputData.Select(e => new { e.ExternalId, e.Hash }).ToList();

    var dataToRemoveIds = dbDataIds
        .Select(d => d.ExternalId)
        .Except(inputDataIds.Select(d => d.ExternalId));

    var dataToAddIds = inputDataIds
        .Select(d => d.ExternalId)
        .Except(dbDataIds.Select(d => d.ExternalId));

    var dataToModifyIds = dbDataIds
        .Where(e => inputDataIds.Any(d => d.ExternalId == e.ExternalId && d.Hash != e.Hash))
        .Select(e => e.ExternalId);

    var dataToRemove = await DbContext.Events.Where(e => dataToRemoveIds.Contains(e.ExternalId)).ToListAsync();

    var dataToModify = await DbContext.Events.Where(e => dataToModifyIds.Contains(e.ExternalId)).ToListAsync();

    var dataToAdd = inputData.Where(e => dataToAddIds.Contains(e.ExternalId)).ToList();

    foreach (var dbE in dataToRemove)
    {
        dbE.DeletedDate = now;
        dbE.ModifiedDate = now;
    }

    foreach (var e in dataToAdd)
    {
        e.ModifiedDate = now;
        DbContext.Panels.Add(e);
    }

    var inputDataLookup = inputData.ToDictionary(e => e.ExternalId, e => e);
    foreach (var dbE in dataToModify)
    {
        var e = inputDataLookup[dbE.ExternalId];

        dbE.Topic = e.Topic;
        dbE.StartDate = e.StartDate;
        dbE.EndDate = e.EndDate;
        dbE.Participants = e.Participants;

        dbE.Hash = e.Hash;

        dbE.ModifiedDate = now;
        dbE.DeletedDate = null;
    }

    await DbContext.SaveChangesAsync();
}

Now that we have synchronized the data between XML and our DB, we can move to next point.

Part 2 – utilizing time stamps

Mobile applications use API method presented below to fetch data

public async Task<IActionResult> GetConferenceAgenda(DateTime? modifiedAfter)
{
    modifiedAfter = modifiedAfter.GetValueOrDefault();
    var now = DateTime.Now;

    var panels = await _conferencesService.GetEvents(e =>                
        e.ModifiedDate >= modifiedAfter && e.ModifiedDate < now, EventDTO.Selector);

    return Ok(new ConferenceAgendaData()
    {
        NextModifiedAfter = now,
        Panels = panels,               
    });
}

The idea behind ModifiedAfter parameter is to fetch only the data has that changed since last invocation. This method also has to work for initial request, thus we support null value. Passing in null defaults to 0001-01-01. We execute a simple query to get necessary data (notice we only work against ModifiedAfter property as it’s being set upon every add, update and soft deletion), transform this into DTO (Data Transfer Object), and return the result alongside timespan value the mobile app should use for its next request. Why is that NextModifiedAfter value important? It’s due to possible time differences between mobile and main server on which data resides. Without this synchronization we could hit an issue, when it’s the caller determining next value and its system clock is ahead of server’s one. In this case, if any data had been modified within the time fitting right in between server’s and mobile’s clocks, this data would not be brought back to the caller.

Returned DTO object looks like this:

public class EventDTO
{
    public int Id { get; set; }
    public bool IsDeleted { get; set; }

    /* more data */

    public static Expression<Func<Event, EventDTO>> Selector => e => new EventDTO()
    {
        Id = e.Id,
        IsDeleted = e.DeletedDate != null,
        
        /* more data */
    };
}

It’s just a simple class, but I would like to make two points.

The first point (irrelevant to this discussion but still worth mentioning) is the Selector property, which encapsulates SELECT done by the database (we return Expression<Func<T, TOut>> instead of just Func<T, TOut>).

The second point is the existence of IsDeleted property. Since we need to tell the caller which data was removed, we are enforced to use soft deletion mechanism on the server, use timestamps for differentiation, and bring necessary information back.