Pages

Tuesday 30 March 2010

Comparing Result Sets with Linq

Linq is really handy. Once you’re familiar with the syntax and the way it works, it becomes very useful.

Recently I had to write a front-end to compare lists of financial instruments coming from two different sources. One source was a SQL Server database, the other source was an in-house market data system.

The result set from the database was extracted by a stored procedure into a collection. The collection’s type is IEnumerable<Result1> where Result1 is a type automatically generated by the Linq designer using the stored procedure definition.

The other result set comes from the in-house market data system. It is extracted with an API that constructs Result2 objects and accumulates them into a collection List<Result2>.

The purpose of the little app is to show the instruments present in the first result set and not the other one. The types Result1 and Result2 are defined as follows:

 

/// <summary>
/// In my real project this class was generated by the Linq designer
/// </summary>
class Result1
{
    public string Isin { get; set; }
    public string Sedol { set; get; }
}

/// <summary>
/// Manually created class to store objects from the in-house market data system
/// </summary>
class Result2
{
    public string Isin { get; set; }
    public string Sedol { set; get; }
    public string Location { set; get; }
}

Fill both collections with some dummy data:

List<Result1> coll1 = new List<Result1>();
List<Result2> coll2 = new List<Result2>();

coll1.Add(new Result1 { Isin = "ABCD", Sedol = "123" });
coll1.Add(new Result1 { Isin = "EDFG", Sedol = "234" });

coll2.Add(new Result2 { Isin = "EDFG", Sedol = "234", Location = "Set2" });

I want to compute result1 MINUS result2. In Linq this is done with Except. So I’d like to write something like this:

var diff =
    (from c in result1
     select c).
    Except
    (from c in result2
     select c);

…but that does not compile because result1 and result2 have two different types. Therefore I need to do a projection, like this:

var diff =
    (from c in result1
     select c).
    Except
    (from c in result2
     select new Result1 { Isin = c.Isin, Sedol = c.Sedol });

The above compiles fine, but there is a problem: the result of the diff is the original result1 collection. It doesn’t remove the elements present also present in result2! This is because at this stage Except does not know how to test Result1 objects for equality.

Except takes an IEqualityComparer<T> as second argument. Let’s define one:

class Result1Comparer : EqualityComparer<Result1>
{

    /// <summary>
    /// Two elements are considered equal if they match on at least one identifier
    /// </summary>
    /// <param name="x"></param>
    /// <param name="y"></param>
    /// <returns></returns>
    public override bool Equals(Result1 x, Result1 y)
    {
        return ((x.Isin == y.Isin) || (x.Sedol == y.Sedol) );
    }

    public override int GetHashCode(Result1 obj)
    {
        // Just re-use Object's default GetHashCode
        return obj.GetHashCode();
    }
}

The diff now looks like this:

var diff =
    (from c in result1
     select c).
    Except
    (from c in result2
     select new Result1 { Isin = c.Isin, Sedol = c.Sedol },
     new Type1Comparer());

It’s still not working! Why? When Except compares the result sets, it uses the comparer’s GetHashCode first. If two objects have two different hashcodes then it doesn’t bother calling Equals().

In the code above the GetHashCode() override calls the default Object.GetHashCode(). Object.GetHashCode() is useless: it generates an index as a function of the object reference. Even if two objects have the same value Object.GetHashCode()returns two different hashes. And Except thinks the objects are different!

No need to burn too much brainpower on writing the hash function as long as:

  • it’s fast
  • two objects with the same value return the same hashcode.

The following should do:

class Result1Comparer : EqualityComparer<Result1>
{

    /// <summary>
    /// Two elements are considered equal if they match on at least one identifier
    /// </summary>
    /// <param name="x"></param>
    /// <param name="y"></param>
    /// <returns></returns>
    public override bool Equals(Result1 x, Result1 y)
    {
        return ((x.Isin == y.Isin) || (x.Sedol == y.Sedol) );
    }

    public override int GetHashCode(Result1 obj)
    {
        // Objects with different Isins will be considered as different
        // If 2 objects have same Isin then Equals is used
        return obj.Isin.GetHashCode();
    }
}

No the MINUS operation with Except works.

Limits of Linq:

The SQL-like way of comparing collections is elegant but it’s nothing more than a nice way of writing for loops. Don’t expect miracles on the performance side…

The best place to perform queries is still in the database.

In the case of the app described above performing queries on the client is just fine. However there are situations where importing into the database and having all the queries take place in SQL Server would be faster.

1 comment:

Walter Almeida said...

Yes Linq is a great tool, especially for a real uniform way to access (almost) any kind data. And yes you always have the compromise between performance and ease of coding, question is to choose the right level depending on your constraints. My way of thinking: always try to make the most of new technologies giving you higher level of abstraction and find the way to architect your applications using these technologies to still achieve your performance goals (of course some new technologies will happen to be bad and to be dropped..). If not doing so: we would still be coding in assembly languages.