Among the most dramatic results of last Monday’s hearing on President of the United States Donald J Trump’s Twitter habits and related matters was the appearance in the virtual pages of Lawfareblog – among the majorest of major minor blogs of this post-blog epoch – of the Phantom Non-Breaking Space Bug.
Chrome Inspection reveals a major minor infestation in Lawfareland:
As of this writing, Chrome and Brave will insert the unwanted, layout-endangering character code via interaction with “TinyMCE,” a key component in text editing applications around the web, at a virtual level beneath or deeper than the HTML that we see in the lower section of the image above. In short, when the user, operating in a certain common way, replaces selected text with new text, an adjacent white space gets converted into a non-breaking space or facsimile, which can disrupt word-wrapping, since the non-breaking space is, perhaps unsurprisingly, a space that does not break: The affected browser treats the
as a single letter within a single, not to be broken word. On occasion, the presence of an unwanted non-breaking space may also interrupt other processes or applications (some embedders, footnoters, and the like) that look for an empty space at the particular spot, not a pseudo-empty one.
Note that the bug does not recur during many common types of revision, but mainly during a peculiar type of text highlighting and replacement: The typical writer behavior that produces the problem is, to be more precise, a kind of confined over-writing (in the operational, not the stylistic sense), not usually simple deletion and replacement. Note also that it occurs only while working in the “Visual” editor (the preference of, I believe, the vast majority of WordPress users). ((If you wanted to re-produce the bug for your own edification, you could open a WordPress post for editing, target some text for inspection, highlight just the text you want to replace, and then hit the backspace key once, and you’ll see the
appear. A lot of the time in such situations, a user would proceed by hitting backspace a second time, and in the process clear the ,
but if, instead, you carefully highlight only the text you want to replace, then over-write it with your intended improvement, then skim over to some other location or just hit “update,” the
will remain behind.))
Unfortunately, the problematic approach to revising text is quite common and in some circumstances and for some writers will be second nature: see a word, want a different word, over-write it, move on: I do it, and I would be willing to bet that Mr. Wittes at Lawfare was doing it when finishing off his latest excellent piece on matters legal and Trump. On a heavily revised post, the bugs may accumulate, and, to make things even worse, these potential layout-mutilators, blogger-humiliators will remain invisible to you when you are looking at your apparently quite beautiful handiwork in un-augmented WordPress Visual or Text editing panes. If you’re in a bit of a hurry and neglect to check your preview carefully, it may be only the published post that suddenly reveals the typographical horror that you have unleashed upon your readership.
As of this writing, the bug still affects Chrome and Brave, which are both “WebKit”-derived browsers, but not (at least on Windows10) Safari, even though Safari happens to be the WebKit grandparent. Otherwise, comparisons across browsers and machines may turn up other incidences or ways to non-breakingly break your layouts, similar to others reported over the years, but, in any event, a particular problem on Chrome – est. over 50% of all web usage as of December 2016 – affecting WordPressers (ca. 25% of the web last I checked) is already a widespread problem, even if in itself the problem may seem to some to be a small problem. ((It very likely affects other writers on other platforms, but they’ll just have to fend for themselves, I guess.))
OK, so how about fixing it?
Because the bug occurs “upstream” from WordPress itself, we cannot fully eradicate it from within WordPress, but we can prevent it from being saved to the database and actually affecting display when posts are loaded.
First, I’ll present a way to stop the dang thing from happening, then I’ll deal with some choices for dealing with archived posts already littered with the unwanted characters.
Using PHP and WordPress filters to get rid of the bad characters
The most straightforward PHP solutions will utilize either the content_save_pre
or wp_insert_post_data
WordPress filter. All post content passes through each of these filters before getting saved. wp_insert_post_data
also handles a lot of other things other than post content, so qualifies as both more powerful and more complicated than content_save_pre
.
First, for a simple, global fix, here’s the content_save_pre
version.
/** * REMOVE NON-BREAKING SPACES FROM POST CONTENT * Addresses bug affecting TinyMCE in Chrome and Brave other browsers * handles $content already treated by Next Gen Gallery et al * also works with wp_insert_post_data, as per answer by Rarst at * http://wordpress.stackexchange.com/questions/168356/how-to-stop-wordpress-from-saving-utf8-non-breaking-space-characters * set 99 execution priority so after default filters have "fired" **/ // add_filter( 'content_save_pre', 'remove_buggy_nbsps', 99 ); function remove_buggy_nbsps( $content ) { //strip both types of non-breaking space in case NGG or similar is installed $content = str_replace( ' ', ' ', $content) ; $content = str_replace( '\xc2\xa0', ' ', $content) ; return $content ; }
The main peculiarity to note in the above are the characters that WordPress, or, to be more precise, the interaction of browser and database encoding, interprets as
, but which lurk in the “UTF-8”-encoded WordPress MySQL database as \xc2\xa0
. It can be confusing and frustrating if you were not already aware of this factor – and that’s all I really wanted to say here, except that I also have to note the other peculiarity of the code, which is that it addresses both the HTML
as well as the UTF-8 \xc2\xa0.
The UTF-8 version was supplied by WordPress maven “Rarst” at StackExchange as per the link in the code, but Rarst’s solution would not work at, for instance, this site, or any other of the 1 million WordPress sites that use Next Generation Gallery (NGG), or at any site where a plug-in was in use that happen to treat “the content” with script similar to the one that, as I discovered, NGG uses. ((I had persuaded myself – and had misinformed some users – in a WordPress Support Forum and at Make WordPress Core – that the same code exept with nbsp;
, only, not with the UTF-8 encoding, would solve their problems. However, when I happened to test my work with NGG de-activated, the fix failed: It just so happens that, deeply buried in NGG (one of the more complex of widely used WP plug-ins), an Ajax process is initiated that makes the character-code accessible to the function as HTML. With NGG active, the UTF-8 version doesn’t work. So, I’ll recommend using code with both versions.))
For the vast majority of users, the above will be as good as a true fix, though there are times and places where, in the course of human blog-editing events, someone might like to keep an
, or even a bunch of them. For those who need to allow for exceptions, we can add one of WordPress’s many built-in conditional tags to isolate certain types of post or post content from the function.
For the example, we’ll shift to wp_insert_post_data
:
/** * REMOVE NON-BREAKING SPACES FROM POST CONTENT * EXCEPT IN SILLY APP CATEGORY * Addresses bug affecting TinyMCE in Chrome and other browsers * Add to theme functions.php or plug-in **/ add_filter( 'wp_insert_post_data', 'remove_buggy_nbsps_with_exceptions', 99, 2 ) ; function remove_buggy_nbsps_with_exceptions( $data, $postarr ) { //don't do anything if is in "Silly App" category if ( ! in_category( 'Silly App' ) ) { //strip both types of non-breaking space in case NGG or similar is installed $data = str_replace( ' ', ' ', $data) ; $data = str_replace( '\xc2\xa0', ' ', $data) ; } return $data ; }
You could write something that gave users the ability to exclude other types of posts, or even to quarantine or reverse-quarantine particular passages, though I think that the vast majority of users will be satisfied with never using '
s in post content at all, as in the first example.
There is a third PHP version, or actually something of a hybrid, that also works, but not quite as well, in my opinion. It was offered by a user at Make WordPress Core, on one of several somewhat redundant threads on this subject. He or she neglected to include a full working version of it, so I’ll do that here. Note that you may have to save twice to get it to work after installation.
/** * Hybrid NBSP Killer * based on script at https://core.trac.wordpress.org/ticket/31157#comment:5 **/ add_action( 'after_wp_tiny_mce', 'kill_nbsps_with_php_and_js' ) ; function kill_nbsps_with_php_and_js() { ?> <script type="text/javascript"> tinymce.on('AddEditor', function(event) { var editor = event.editor; editor.on('getContent', function(e) { var content = editor.getContent({ format: "raw", no_events: 1 }); e.content = content.replace(/ /ig, ' '); }); }); </script> <?php }
The advantage of this hybrid might be that it will leave hard-coded
‘s intact if entered and saved under the “Text” pane. However, solutions that depend on users remembering to avoid the Visual pane are prone to failure. Having a category or other taxonomy, or perhaps a format or Custom Post Type, set aside for leaving desired non-breaking spaces alone would be more user-proof. Though with some further work we could introduce warnings or other workarounds, I’ll stick with the other alternatives.
But what about all the posts already saved with all those extra  
;’s?
There are two ways to handle the problem of already-infected legacy posts. Which one you choose will depend on how much you care about curing your database permanently, and how comfortable you are dealing with it the database directly. If you actually like or need, or liked or needed, to use non-breaking spaces for certain purposes, meaning your database includes a sprinkling or more of non-breaking spaces that you’d like to preserve, that might also figure into your calculations.
jQuery-style
Without touching your database, and independently of whether or not you employ a version of the PHP functions, you can apply a jQuery solution that will at least preserve good layout. Note that all of these examples assume your theme is using standard WordPress post classes: If you’re using a specialized theme with its own unique approach, the code might have to be adjusted, perhaps as substantially as your theme differs from vanilla WordPress.
For the front end, the following script will do – but it needs to be enqueued carefully. A safe general-purpose alternative will be supplied further below:
/* Substitutes Empty Space for $nbsp; in Post Content */ jQuery( function($) { var oldhtml = $('.post-entry').html(); var newhtml = oldhtml.replace(/ /g, ' '); $('.post-entry').html(newhtml); });
Enqueue it only for single-post pages – as below – or you will likely end up with destroyed front pages or archive pages:
/** * GETS RID OF NBSPs on Single Post Pages ) * Add the script itself by normal means, using **/ if ( is_singular() ) { add_action( 'wp_enqueue_scripts', 'add_anti_nbsp_script') ; }
The same script will work on the Admin side, but with body
substituted for .post-entry.
How exactly you’ll want to add the script to your set-up may vary with features of your theme, as well as with your preferences for script consolidation. For details on how to add jQuery scripts to WordPress themes, refer to the WordPress Codex or any of countless tutorials, but here’s a basic format for adding the script, just for “singular” posts/pages, excluding the ‘Silly App’ category, and post ID# 64513. We’re locating the script in the anti_nbsp.js
file in our theme’s js
folder:
/** * ENQUEUE SCRIPTS FOR "SAMPLE CHILD" THEME * Don't if post is post ID# 64513 * or in category "Silly App" **/ function add_anti_nbsp_script() { wp_register_script( 'anti-nbsp', get_stylesheet_directory_uri() . '/js/anti_nbsp.js', array( 'jquery' ), SAMPLE_CHILD_VERS, //for Defined Version of Theme* FALSE ); global $post ; if ( is_singular() && ! in_category( 'Silly App' ) && 64513 !== $post->ID ) { wp_enqueue_script( 'anti-nbsp' ) ; } } add_action( 'wp_enqueue_scripts', 'add_anti_nbsp_script' ) ;
*See notes. ((I like to add a version number to scripts during testing, so that when I update them I don’t have to re-set the browser cache to see the results. For production sites, you may wish to leave these off for sake of the (slight) benefit to page speed scores.))
The above code sets the script to work for single post pages only. If your theme uses “the_excerpt” and is set to include posts as saved up to a “read_more” link (fairly standard among themes, but by no means universal), you may still end up with some pesky NBSPs on archive pages ( “archive pages” in this usage includes typical blog front pages as well as typical author pages, search pages, category pages, and so on).
To cover that alternative, you might want to try the following script, enqueued without the exclusions. Or you can use both scripts, with the following one enqueued separately for ! is_singular()
– i.e., for pages other than archive pages:
//GENERAL PURPOSE ANTI-NBSP FOR BOTH "POSTS" AND "PAGES" ON FRONT END jQuery( function($) { //switching to .hentry to include pages as well as posts $('.hentry').each( function() { var oldhtml = $(this).html(); var newhtml = oldhtml.replace(/ /g, ' '); $(this).html(newhtml); }) ; });
The above script enqueued without exclusions would work for posts whether on single post pages or in archives – so, if you’re not really worried about keeping some
‘s, it might be preferable.
Working out more precisely targeted exclusions for particular posts or post-types with hard-coded
‘s that serve some purpose for you or your plug-ins, and that you want to be allowed in archive pages, too, while not enqueueing the script separately, is certainly possible. Almost anything is possible text-manipulation-wise, but the precise method will probably need to be tailored to your installation: For instance, if we knew that your app always produced a certain CSS class, we could target the exclusion – maybe $('.hentry').not('.category-silly-app').each(//etc.
– or some such. ((By the way, .hentry
has been around forever. What does or did it stand for, I wonder. “Headline entry” maybe?))
Curing the db
Finally, if you want to cure the database – on general principle or because that’s what your client wants – you can do it with WordPress commands: WordPress is already a database manager, although it’s user-friendlier for some purposes than for others.
Your code might look something like the following:
/** * GET RID OF NBSPS IN DATABASE * use $args (see https://codex.wordpress.org/Class_Reference/WP_Query) * to adjust selection parameters, as operation is resource-intensive **/ $args = array( 'numberposts' => -1, //(get 'em all, maybe) // you'll want to narrow it on any sizeable database ); $posts = get_posts($args); foreach ( $posts as $post ) { $content = $post->post_content; $content = preg_replace('/\xC2\xA0/', ' ', $content); $newpost = array( 'ID' => $post->ID, 'post_content' => $content, ); wp_update_post( $newpost ); }
The catch with the above is that the operation turns out to be resource-intensive: Even on an only moderately large database of posts – ca. 1,000 or so – you may get timeouts, and either need to adjust your PHP configuration, or slice up the selection into pieces using WP_Query parameters, which get_posts()
does accept, and which you can also use to exclude certain posts or types of post.
For a large database, if you’re not comfortable with MySQL database operations, you may be better off 1) using phpMyAdmin or some other tool to export the posts table, then 2) using a text editor to run a search and replace on the table (for “\xc2\xa0”), then 3) re-importing it in place of the original. All usual warnings about making backups and backups of backups apply here, especially if you’re not used to working with the db. More complex operations will require good MySQL skills.
If you do need to slice and dice the table – specifically excluding or including different categories or posts or post-types, including some time periods but not others, etc. – you may be better off going back to WP_Query arguments, which, not to put a fine point on it, were designed by people whose MySQL skills are likely better than yours and certainly superior to mine.
Or…
You could also try never editing posts on affected browsers. Or you could be very careful when editing posts on those browsers, including by picking through the text using an inspection tool, and manually deleting
‘s. …Or by never using the Visual Editor. Or you could live with occasionally messed-up layouts, while waiting for an upstream fix that may never come…
…or that may turn up unexpectedly any day, rendering this entire discussion obsolete…
What I do:
In my case, since I used Firefox for almost everything for years, and tried to correct bad layouts when they cropped up, I’m not worried about the database being heavily infected. Since I now use Chrome as a baseline and Brave as a backup, however, I’ll be sticking with the first function for new posts, and the jQuery solution to keep the Front End tidy.
I just want to thank you, both for the code and thorough explanation. I had been driven nuts for months by extra (ampersand)nbsp; being inserted into WordPress posts when pasting content from MS Word. Now when saving WP posts as draft, they are scrubbed out. I am grateful for the time saved, now and in the future.