Which is faster: whereHas vs join

Sep 28, 2022

I have heard multiple times that whereHas queries are slower than join queries, so I decided to test it to figure out if I should stop writing whereHas queries once and for all.

So let me show you what I learned from my "experiment" and when you should consider changing your whereHas queries to join.

The setup

First I needed a relationship between two models to test the speed of the queries. I decided to go with products and variants (think color, size, etc). So where are the models:

class Product extends Model
{
    public function variants()
    {
        return $this->hasMany(Variant::class);
    }
}

class Variant extends Model
{
    public function product()
    {
        return $this->belongsTo(Product::class);
    }
}

And here are the migrations:

return new class extends Migration
{
    public function up()
    {
        Schema::create('products', function (Blueprint $table) {
            $table->id();
            $table->string('name');
            $table->text('description');
            $table->integer('price');
            $table->timestamps();
        });
    }
}

return new class extends Migration
{
    public function up()
    {
        Schema::create('variants', function (Blueprint $table) {
            $table->id();
            $table->foreignId('product_id');
            $table->string('name');
            $table->timestamps();
        });
    }
}

As you can see, we have Product model and products have many Variants. Simple.

Seeding the data

Since I had a small suspicion that the way the data is distributed in the database would affect the results (more on that later) I decided to add 5000 products with one variant and 5000 without variants. So 10,000 products in total, half of them have variants and half of them don't.

class DatabaseSeeder extends Seeder
{
    public function run()
    {
        // with variant
        Product::factory(5000)->has(Variant::factory())->create();
        // without variant
        Product::factory(5000)->create();
    }
}

Comparing speeds

The first query I tested was the simplest one, a whereHas without condition and the equivalent join. After each query you can see the time they took to run in milliseconds:

// whereHas
Product::whereHas('variants')->get(); // 12.75ms
// join
Product::select('products.*')
    ->join('variants', 'variants.product_id', '=', 'products.id')
    ->get(); // 18.88ms

For this simple use case, the whereHas was a little faster than the join, which actually surprised me, as I said before I have heard that joins are faster than whereHas, and even though it wasn't by much, the whereHas was faster here, so that means that whereHas is not the slowest option in all cases.

For the second test, I decided to test filtering the products that have a specific variant, using the variant id. Here are the results:

// whereHas
Product::whereHas('variants', function ($query) {
            $query->where('id', 1);
        })->get();  // 0.48ms
// join
Product::select('products.*')
        ->join('variants', function ($query) {
            $query->on('variants.product_id', '=', 'products.id')
                ->where('variants.id', 1);
        })->get(); // 0.23ms

Both were pretty fast in this case, but the join was faster. This made me realize something, if you know how whereHas works under the hood, it creates a exists query to fetch the results:

select * from `products` where exists (select * from `variants` where `products`.`id` = `variants`.`product_id` and `id` = 1)

Inside this exists query there is another select query. After seeing the two first tests I realized that the whereHas performance is heavily affected by how fast that second select can find a row that matches the conditions.

With this theory in mind, I created another test, similar to the previous one, but instead of using the id I decided to use the name of the variant, since look ups by id are faster than look ups using a non-indexed column. Here are the queries with the results:

$variant = Variant::first(); // Use this variant to search by name

// whereHas
Product::whereHas('variants', function ($query) use ($variant) {
            $query->where('name', $variant->name);
        })->get(); // 11.46ms
// join
Product::select('products.*')
    ->join('variants', function ($query) use ($variant){
        $query->on('variants.product_id', '=', 'products.id')
              ->where('variants.name', $variant->name);
    })->get(); // 1.89ms

The whereHas took 11.46ms, compared to the 1.89ms from the join query, which is a ton faster. Using a non-indexed column made the whereHas a lot slower but the join was only slightly affected in comparison.

What would happen if we add an index to the name column in the variants table?

    public function up()
    {
        Schema::table('variants', function (Blueprint $table) {
            $table->index('name');
        });
    }

Here are the results after adding the index:


$variant = Variant::first(); // Use this variant to search by name

// whereHas
Product::whereHas('variants', function ($query) use ($variant) {
            $query->where('name', $variant->name);
        })->get(); // 1.02ms 
// join
Product::select('products.*')
    ->join('variants', function ($query) use ($variant){
        $query->on('variants.product_id', '=', 'products.id')
              ->where('variants.name', $variant->name);
    })->get(); // 0.76ms

The whereHas got a lot faster, while the join remained almost at the same speed. Which is interesting.

Changing the data distribution

So, we figured out that the whereHas performance is heavenly reliant on the select inside the exists clause.

So I wanted to test, what would be the speeds if it's very uncommon for a Product to have variants, so I cleaned my database and ran this seeder:

class DatabaseSeeder extends Seeder
{
    public function run()
    {
        // with variant
        Product::factory(1)->has(Variant::factory())->create();
        // without variant
        Product::factory(9999)->create();
    }
}

Only 1 out of 10,000 products has a variant. Here are the speeds for the join and the whereHas queries using this dataset:

// whereHas
Product::whereHas('variants')->get(); // 0.30ms
// join
Product::select('products.*')
    ->join('variants', 'variants.product_id', '=', 'products.id')
    ->get(); // 0.14ms

This time, the join is slightly faster than the whereHas. So the other way around than when half of the products had variants. Interesting.

So what about the other way around, what if out of 10,000 products only 1 didn't have a variant?

Again I cleaned my database and ran this seeder:

class DatabaseSeeder extends Seeder
{
    public function run()
    {
        // with variant
        Product::factory(9999)->has(Variant::factory())->create();
        // without variant
        Product::factory(1)->create();
    }
}

And here are the results for the queries using that dataset:

// whereHas
Product::whereHas('variants')->get(); // 26.79ms
// join
Product::select('products.*')
    ->join('variants', 'variants.product_id', '=', 'products.id')
    ->get(); // 36.57ms

We are back to where we started, the whereHas is faster than the join. This leads me to believe that the whereHas is faster if the condition is more common than not.

That also explains why in the other queries using the 'id' and 'name' the whereHas was slower, because most products didn't have that specific variant.

So which is better?

As our tests indicate, it depends, if the conditions you are using in your whereHas are more common than not, then it will probably perform better than the equivalent join.

If that is not the case (ie. You have conditions that make the result set very small) then the join will probably perform better, in some cases a lot better as we saw with our non-indexed name test.

But performance is probably not the only thing we need to take into account when writing queries. In my opinion join queries are a lot less readable than using whereHas because their intended use is to select columns from another table, not to filter results, even if they can be used for that end.

So I would keep using whereHas for the readability if I'm working with a dataset where the results will fetch most of the rows on my target model or if the amount of records in the database is small enough that the impact of changing that query to a join would be minimal.

On the other hand, I would use a join if I'm working with a big database and the conditions would retrieve only a small subset of the records in my database.

Conclusion

So should you change all your whereHas queries for join queries? The answer is no, both have advantages and disadvantages and should be used depending on the specifics of the situation.

Let me know if you liked this article or have any comments or questions by reaching out on twitter, I'm pretty active there :).