ideas@sullice.com

Tagging tens of thousands of Drupal entities in under half a second

1 August, 2023

To be precise, I just added an entity reference item to 20,263 webform submissions in 449 milliseconds to verify that statistic 😀.

This post is a technical but hopefully readable description of how you can add entity field items, en masse, without the need for a batch process, and how I arrived at that solution. But first, why did I need to figure that out in the first place?

Background

Since I left my day job and started building a Drupal-based SaaS company, I’ve been keeping the lights on by working with different clients to build and maintain Drupal applications. Recently, I had the opportunity to write a new module for IFundWomen (IFW), a mission-driven, for-profit organization that serves women entrepreneurs.

A primary line of their business is creating, receiving, and curating large numbers of grant applications. A process which ultimately leads to the award of much-needed capital to women-owned businesses. They’ve received tens of thousands of applications for their various grant opportunities over the years. Understandably, whittling down such large volumes of applications to a top-tier list of finalists is no small feat—curating these applications by hand takes many hours of effort. I was asked to implement a tool to accelerate that process.

Design considerations

One of the challenges of writing a tool for this task was that some applications are easy to disqualify programmatically: for example, was a necessary field left blank? Yet other qualifying factors are more subjective: for example, was the elevator pitch compelling? In other words, it’s not entirely automatable. In fact, it’s important to have a human-in-the-loop to ensure applications don’t slip through the cracks of a rigid, programmatic solution. Therefore, the goal was to design a tool to accelerate, not replace, the process of discovering the most qualified candidates.

We call factors like the ones mentioned above qualifications and while each qualification is assessed, potentially thousands of submissions need to be effortlessly disqualified. Here’s an example: IFW’s Universal grant application permits IFW to collect applications for multiple grants at once. A hypothetical toy manufacturing company would like to award a grant to parent entrepreneurs. The reviewers should be able to disqualify all submissions that didn’t answer Yes to the Do you have children? question with no more than a few clicks.

The module I built (alongside Peter Weber, Kay Thayer, and Thanh Uong) provides a custom interface for curating these submissions. The workflow proceeds like so:

Create an Application by choosing a webform from which to draw submissions
Create a Qualification which includes choosing pertinent form elements and a method of qualification, which is either manual or programmatic:
- If programmatic, a condition is configured such as [element] value is [at least] [number]
A paginated table of submissions and the selected values appears
- If the qualification is programmatic, it is pre-filtered to show submissions which meet the qualification criteria. This can be toggled to show only those that do not meet the criteria
Select one or more submissions
- If all are selected, a checkbox appears offering to apply the next step to all submissions, across all pages
Click the Disqualify button
Repeats steps 2–5 until only fully-qualified submissions remain
Manually review and select finalists from this list of submissions

The seemingly minor checkbox that appears on step 4a created the problem that motivated this post. That checkbox is a necessary convenience, but it creates a big problem: If selected, pressing the Disqualify button triggers a process that may need to update tens of thousands of entities, yet, it needs to feel fast to the user. Therefore, that process needs to be executed in the most efficient manner possible.

Technical details

A submission is disqualified when one of its entity reference fields (named disqualifications) references one of the Qualification entities created in step 2¹. Therefore, to disqualify an arbitrary number of submissions, a new entity reference item must be inserted into the entity reference field’s table for each and every one of the selected submissions. This is essentially how Drupal tagging works. The only difference is that instead of referencing taxonomy term entities, submissions reference qualification entities.

Since Drupal’s entity and field APIs do not afford methods for updating multiple entities at once, a custom solution was required. Conveniently, the item’s value is the same².

Drupal’s Batch API is a great tool for processing large volumes of work that might exceed PHP’s max execution time limit. My first iteration took this conventional approach. It worked, but slowly. Updating hundreds of entities took more than a minute to complete, which did not provide for a pleasant user experience and that experience quickly degraded as the number of entities to be updated grew arbitrarily large. Not only was a custom solution required, we needed an optimized one too.

Restating the problem: updating thousands of entities is too slow. Moreover, there are no APIs for doing that and using the Batch API still isn’t quick enough. What limits the Batch API? It’s slow because every entity update requires two round-trips to the database—one to fetch the entity data and one to save a new value. By loading entities in batches, the Batch API can address half the problem since it still must update them one-by-one. In other words, even in the best-case scenario, the Batch API can only reduce the work by 50%.

That left direct database interaction as the last, best option. By doing so, tens of thousands of rows could be updated in a single route-trip. To understand the specifics of that solution, you’ll need to understand how Drupal stores field data (no worries, it’s summarized below).

The most practical drawback of interacting directly with the database is that it works entirely outside of the Drupal entity and field APIs. That means things like cache invalidation are left as an exercise for the reader. It also introduces a tight coupling to the field’s database schema—and if the schema changes upstream—your code may abruptly stop working or cause data integrity issues³.

Diving down the stack

In the typical case, Drupal’s field values are stored in unique database tables. One table per field, per entity type. For example, if you add a term reference field to a content type with the machine name field_tags, Drupal will create a database table named node__field_tags⁴. These field tables always have the following columns and comments:

bundle: The field instance bundle to which this row belongs, used when deleting a field instance
entity_id: The entity id this data is attached to
delta: The sequence number for this data item, used for multi-value fields
…omitted…: A few other columns not relevant to this post, such as deleted and langcode

These tables also have one or more columns specific to the field type they store. For example, entity reference fields have a {field_name}_target_id column which is used to store a reference to another entity. This is how taxonomy term references work and it is how our module’s disqualifications field works: if an application references a Has website qualification, it has been dis-qualified for not having a website.

Here is a representation of the disqualifications table with some example data:

`bundle`	`entity_id`	`delta`	`disqualifications_target_id`	`…`
`grant`	`10`	`0`	`is_incorporated`	`…`
`grant`	`10`	`1`	`has_website`	`…`
`grant`	`42`	`0`	`is_incorporated`	`…`
`…`	`…`	`…`	`…`	`…`

The table above shows two disqualified submissions to the grant webform with IDs 10 and 42. Submission 10 was disqualified for two reasons: the applicant hadn’t incorporated a business yet and the application did not link to a website. Submission 42 was only disqualified because the applicant hadn’t incorporated yet, as before.

Hey! If you’d like us to help solve your technical problems too, don’t hesitate to reach out 😀 Just shoot us an email

Implementing a solution

Now, assume that both submissions must also be disqualified for a missing elevator pitch. Naïvely, one could execute the following query:

$connection = \Drupal::database();
$query = $connection->insert(‘field_table_name’)
  ->fields([
    ‘bundle’,
    ‘entity_id’,
    ‘disqualifications_target_id’,
  ]);
foreach ([10, 42] as $submission_id) {
  $query->values([
    ‘bundle’ => ‘grant’,
    ‘entity_id’ => $submission_id,
    ‘disqualifications_target_id’ => ‘has_elevator_pitch’,
  ]);
}
$query->execute()

// Which compiles to:
// INSERT INTO field_table_name (bundle, entity_id, disqualifications_target_id)
// VALUES
//   (‘grant’, 10, ‘has_elevator_pitch’),
//   (‘grant’, 42, ‘has_elevator_pitch’);

Unfortunately, this would not work since the delta column value is required.

But what delta should the query insert? 0 will fail because it will conflict with the existing values for both submissions, 1 will conflict with submission 10’s has_website reference. In this specific case, 2 would work, but it wouldn’t work if another submission already references a third qualification. The delta value needs to be at least one more than the highest delta value for each of the entities.

Does this mean that we must query for each entity’s highest delta and execute an insert for each one? While that would work, it would be very inefficient. The system would need to execute a database write operation for each entity and doing so would take increasingly more time for every additional entity to be updated. This would not be much faster than loading, updating, and saving each entity one-by-one in a batch process.

My solution was to execute a single SELECT query using the GROUP_CONCAT aggregate function in order to construct a single INSERT query. It proceeds like so:

SELECT all rows from the field table. Each row represents a single field item
GROUP the items by the delta value using the GROUP_CONCAT function
Find all submission IDs to be disqualified that do not have a field item in the 0 bucket
Record those IDs in an array using 0 as the array index
Find all submission IDs to be disqualified that did have a field item in the 0 bucket, but do not have a field item in the 1 bucket
Record those IDs in an array using 1 as the array index
Repeat steps 4–7, incrementing the delta and index each iteration, until there are no submission IDs that have not been recorded in the array

Using the example data from above, the resulting array would be serialized like so:

[
  0 => [], // No submissions need a 0 delta (IOW, all entities have at least one item already)
  1 => [42], // This means: submission 42’s next delta should be 1.
  2 => [10], // And submission 10’s next delta should be 2.
]

The solution then iterates over the constructed array to build an INSERT query. First, it adds values for all the submissions that do not have any pre-existing references using the index, 0, as the delta. Next, it adds values for all submissions with a single reference using the index 1 as the delta. So on and so forth. Programmatically, the solution reads:

// Note: This code is simplified for clarity.

$connection = \Drupal::database();
$query = $connection->select('webform_submission_disqualifications', 'd')
  ->fields('d', ['delta'])
  ->condition('d.entity_id', $submission_ids, 'IN')
  ->groupBy('d.delta');
$query->addExpression('GROUP_CONCAT(d.entity_id)');
$result = array_map(
  fn (string $ids) => array_map('intval', explode(',', $ids)),
  $query->execute()->fetchAllKeyed()
);
$delta = 0;
while (!empty($submission_ids)) {
  $ids_with_delta = $result[$delta] ?? [];
  $delta_ids[$delta] = array_diff($submission_ids, $ids_with_delta);
  $submission_ids = $ids_with_delta;
  $delta++;
}
$query = $connection->insert('webform_submission_disqualifications')
  ->fields([
	'bundle',
	'entity_id',
	'revision_id',
	'langcode',
	'delta',
	$column_names['target_id'],
  ]);
$submission_ids[$delta] = $previous;
foreach ($submission_ids as $delta => $ids) {
  foreach ($ids as $id) {
    $query->values([
      'bundle' => ‘grant’,
      'entity_id' => $id,
      'delta' => $delta,
      ‘disqualifications_target_id’ => ‘has_elevator_pitch’,
    ]);
  }
}
$query->execute();

// Using the example data, this compiles to:
// INSERT INTO field_table_name (bundle, entity_id, delta, disqualifications_target_id)
// VALUES
//   (‘grant’, 2, 1, ‘has_elevator_pitch’),
//   (‘grant’, 1, 2, ‘has_elevator_pitch’);

As you may have already noticed, this technique does have its limitations, like lack of translation and revision support or automatic cache invalidation⁵, but they can definitely be worth the tradeoff as long as they’re carefully considered.

This technique helped provide an impressively quick user experience. But, perhaps best of all, working on this solution gave me a chance to learn some things about the Drupal field system along the way.

I hope you did too!

Actually, the field is named submission_review_disqualifications since the custom module that provides the field is named submission_review. ☝️ And that’s why custom fields are always named field_{something}—they’re provided by the field module! ^[return]
The technique described in this post will not work as-is if the field values must be different between entities, nor will it work as-is for revisioned or translated entities. ^[return]
If you do not own the field type, you should consider adding a unit test to ensure that the upstream schema doesn’t change without you being alerted to that fact as soon as possible. ^[return]
This is not always true. Single-cardinality base fields are stored in columns on the entity type’s field data table. For example, node_field_data stores the node’s type, status, and uid field values, among others. ^[return]
These might be big limitations depending on your needs, but they aren’t inherent to the problem—they’re mostly due to the lack of an API in Drupal core itself. I think a new interface could be afforded and a more robust implementation could be added in a handler class like SQLContentEntityStorage. The blocker to that is finding the time, need, and expertise. ^[return]