DRUPAL TOOLBOX

LET’S FIX STUFF AND BUILD THINGS. MISTAKES ARE OK.

Consolidating Content Fields in Drupal

At DrupalCon Austin, Kristen Pol presented a practical and inspiring session called  "Drupal Site Tuneup - Vroom! Vroom!"  She encouraged us to go through our sites and perform regular cleanup, even if that involved tackling unwieldy content structures.


Redundant/Duplicate Fields: Inefficiency and Sorrow


One of our early Drupal sites was created with two date fields for different content types, and we carried on the same hierarchy of types and fields in a later project with similar content.  We had 90,000 nodes by the time we realized that our various date fields, separated by content type, were used for the same purposes and were in fact redundant.  Not just awkward, this structure was beginning to affect content listings - I wrote a couple of views query alteration to coalesce two fields and get content types to sort together peaceably.


We needed to combine these fields into a single canonical date field.  It was an intimidating prospect: we'd have to get this conversion done on a live site seeing a fair amount of traffic, so we had to avoid significant slowdowns and protect existing data from loss.


VBO and Rules to the Rescue


We settled on Views Bulk Operations + Rules as safest approach.  VBO provides the option of processing nodes on-demand or as a background batch operation; the former allows you to process one or two nodes immediately to make sure everything's working correctly, while the ongoing operation can be handed over to sequential cron runs when you're ready to go (at least for smaller operations).


Since Rules can perform all kinds of actions on nodes (optionally under specific conditions), and since VBO can execute Rules components directly on the nodes returned by a views query, it's extremely useful for safely testing and performing the work of processing field changes and updates.


Of course, you can simply write your own batch operation in a custom module, but the Rules + VBO method provides easy GUI-based configuration and testing.  More importantly, since the actual data handling is managed via pre-existing modules and code (rules actions and views queries), you're free from any faulty queries and data loss that might come about from requesting or saving data incorrectly when calling API functions directly.


If you're comfortable with writing your own batch operations, go for it.  This post will discuss the kinder, gentler GUI option for cases where a safety net is more important than speed (as in our example case), or where working via the admin interface is more convenient.


Basic Caution


We encountered no difficulties using this method - but do make backups during this process in case of any data loss.


Procedure


Here are the general steps to combining several existing fields.  In this case we're assuming redundant essentially-identical fields across separate content types. (Note: we combined different types of date fields - e.g. ISO, timestamp - with no problems).


It may be easiest to select the field on your largest node set as the canonical field.  If field "date_1" belongs to 10,000 nodes on content type A, and content type B and C have only a few hundred nodes each, "date_1" is probably the best candidate for your consolidated data.  You could also create a new field for all three types, if you wanted to change the original field settings.


For each content type that needs its data moved to the new field:


    Add the canonical field to the content type

    If editors will be working on content during this time, you may want to relabel or "hide" the new field on node edit pages (e.g. in its own fieldgroup).

    Create a new "copy field" rules component taking a node as a parameter.

        Add condition: node has [canonical field] (e.g. "date_1")

        Add condition: node has [redundant field] (e.g. "date_2")

        Add action: set data value: [redundant field] => [canonical field]

    If more than a few nodes of this type will be created during the consolidation, you may want to create a rule which runs this "copy field" component as an action whenever a node of this type is saved.  This will prevent your needing to go back and update any new nodes after the initial conversion is complete.

    Set up a view which loads all nodes of this content type.  Display results as a table, adding both the new and legacy date field (so that you can observe which rows have been completed - those will have entries in both date field columns).

    Add a VBO operation to this view, executing the "copy field" component - or simply (re)saving the node with the "save" operation instead, if the on-save rule from #4 has been created.  Select "enqueue" for the operation in the background, or leave it unchecked to run the operation on demand.

    Run the consolidation for this content type from the view page.

    If you're enqueuing the operation, you can use the "select all XXXX rows" button and allow the operation to run in small batches each cron run until complete.  Otherwise, you can set the views pager to the number of items you're willing to process at once (large batch operations can really slow down a site), and select and process all rows on each page, one page (set) at a time.


Once all your content types have the new field filled with legacy field data, you can give the new field prominence on edit pages (if you relabeled or moved it before), and you are free to change node display fields and views fields/sorts/filters to reflect the new, consolidated field.


Notes and Considerations


Because our case involved a very large amount of nodes, we did not delete the original "legacy" date fields.  You may be able to do this safely, but do make backups first, etc.  We wanted to observe the data for some time afterwards before doing anything irreversible.


If you're running a VBO operation as a background batch process, you can use the Queue UI module to observe progress - and to halt it if needed.


We noticed increasingly sluggish server response times a few days after we began a background batch operation on several tens of thousands of nodes. We weren't able to prove it was the operation, but halted it via Queue UI and switched to processing batches "live", 500 at a time.


Occasionally, I have found that a VBO view doesn't complete execution of a rules component when the operation is run, but does correctly run components triggered by a "Save" operation.  I'm not sure why this is - but as in #6 above, you can try either method and see what works.


When fields change, you may need to adjust which fields are indexed or displayed in search results; you may also need to rebuild the search index.