Hello!
Recently we experienced a severe slowdown bottleneck with a Drupal site. After some debugging we decided to turn on mysql slow query logging. After watching the logs we noticed that while the content page was loading in Drupal’s administration back-end there was extremely long scrolling of encoded binary data :
INSERT INTO cache_form (cid, serialized, created, expire, data) VALUES ('form_state_form-xHslzQjLR4W0RX16BL5bG-7K9ewYMiSpj7n-tt7-MBA', '1', '1468987323', '1469008923', '
The scrolling was endless! Even though the content list view in Drupal paginates after displaying about 40-50 nodes, loading these data blobs for every node caused the page load time to take 2-3 minutes! Sometimes it even just timed out and failed with an error. This behavior would be dependent on the php.ini execution time settings and perhaps php-fpm configuration directives for memory allocation and whatnot.
Whats worse, when this was happening, all other actions on the server were impossible to complete. This is because the PHP process spiked to 100% CPU from having to process the large amount of data from the database and convert it to an image.
This was extremely bad and inefficient. Thankfully the server that is used to make changes is segmented from the production site so there were no site outages.
After doing some research, we discovered that images being dropped into the TinyMCE WYSIWYG editor converted them into base64 encoded data, which is then saved into the database. Essentially images are being saved into the database instead of as files, when they are dragged and dropped into the WYSIWYG editor.
The risk of this happening in Drupal increases if you have many users interacting with Drupal content. Resolving this problem is two-fold. The first would be to prevent dragging/dropping images moving forward and force users to use the file upload feature built into the WYSIWYG editor. I wont go into how to do that as its already well documented here.
Retroactively clean base64 encoded images in Drupal
The best way to approach cleaning up nodes in drupal would be to create a module / drush command. In our case we had thousands of nodes where this potentially happened across a long period of time. It got really bad when people dragged and dropped uncompressed images one too many times. Some of the images were 2-4mb in size. Imagine the strain that puts on your database!
I’ll break down the drupal module below, but if you want to dive right into it, you can simply get the Drupal module from our GitHub project.
Create your drush command
We want a simple drush command that allows for the following arguments : content type, language and field name. This will make the module very flexible and work across any content type. Any field that you have a WYSIWYG editor can be addressed by this module.
function star_imgclean_drush_command() {
        $currentdate = date('Y_m_d_H_i_s');
        $items['star-clean'] = array(
                'description' => 'Clean the field of a content type',
                'aliases' => ['star-cln'],
                'callback' => 'star_imgclean',
                'arguments' => array(
                        'content_type' => 'Content type machine name',
                        'lang' => 'Language (en, und or fr)',
                        'field_name' => 'Field to scan',
                        ),
        );
return $items;
}Search for fields that contain base64 encoded images
What we want to do is query the specified content type and field and use PHP’s DOMDocument to load the field’s value in order to parse any “img” tags. Then we check those image tags to see if they have base64 encoded binary data. If they have it, we correct it.
This part of the process took a bunch of trial and error. Mainly because we used regular expressions to match and validate if the img tag has base64 data and if that data matched the mime type of a valid image.
// Load the HTML from the field and parse the img src tags
                        $doc = new DOMDocument();
                        @$doc->loadHTML($content);
                        $tags = $doc->getElementsByTagName('img');
                        $count = 0;
                        // If we found img tags
                        if (count($tags) > 0) {
                                foreach ($tags as $tag) {
                                        $count++;
                                        // Search for base64 encoded data within the img tag
                                        $verify = preg_match('/data:([^;]*);base64,(.*)/', $tag->getAttribute('src'), $match);
                                        if ($verify) {
                                                $data_raw = explode(',', $tag->getAttribute('src'));
                                                $data = base64_decode($data_raw[1]);
                                                $f = finfo_open();
                                                $mime_type = finfo_buffer($f, $data, FILEINFO_MIME_TYPE);
                                                // We need to make sure the encoding of the base64 data is actually an image
                                                $verify_mime = preg_match('/image\/(png|jpg|jpeg|gif)/', $mime_type, $mime_match);If the mime type matched a png, jpg or gif then we want to convert it!
Convert image base64 data to a file
This is where the real magic happens. We want to take that raw base64 data and save it as an image file. Then we want to update the content to refer to the file location instead of the base64 data. This is essentially where the database gets cleaned up because we will be re-saving the field without those huge sets of encoded base64 data into the database.
echo "doing image ". $count ." for node ". $node->nid . "..\r\n";
$new_file = 'sites/default/files/'.$node->nid.'-imgclean-'.$count.'.jpg';
$image = imagecreatefromstring($data);
imagejpeg($image, $new_file, 50);
// Update the img src with DOMDocument
$tag->setAttribute('src', '/'.$new_file);
$content = $doc->saveHTML();What we’re doing above (that you cant see) is iterating across all the img tags in the field. We then establish a counter and create a unique filename for each of the converted files. This means that there can be 10 base64 encoded images in the field and we’ll convert them all. We establish a unique filename for each iteration and use the PHP function imagecreatefromstring and then imagejpeg to convert it to a file.
Once the file is saved , we save the new img src with DOMDocument and update the content variable. When the iteration across the found img tags is complete, we save the node before moving to the next node. There’s two nested for loops within this function , but its pretty straightforward.
There might be ways to make this function execute faster. For about 3,000 nodes it took about 5 minutes to complete.
I hope this helps someone out there! The performance improvements from doing this cleanup are drastic and definitely noticeable. It should be noted that the behavior of converting dropped images in the WYSIWYG editor is expected and part of the TinyMCE editor.