Tuesday, August 25, 2015

Curlify - Make all your quote characters curly

Blogger’s web-based composer is a simple to use interface for creating blog entries. It’s got a basic word processor interface and includes formatting options like bold, italic, number lists, bulleted lists, etc. Unlike most modern word processors, the composer only inserts straight quotes rather than the fancier curly quotes. What’s the difference, you may ask?

Straight quotes are the two generic vertical quotation marks located near the return key: the straight single quote ( ' ) and the straight double quote ( " ).

Curly quotes are the quotation marks used in good typography. There are four curly quote characters: the opening single quote ( ‘ ), the closing single quote ( ’ ), the opening double quote (  “  ), and the closing double quote (  ”  ).

Search and replace algorithm


Doing a search and replace seems like an easy task at first. However, we need to replace two kinds of straight quotes with four kinds curly quotes. How do we know which to use? Opening quotes go at the beginning of a word and closing are used at the end or in the middle of a word. In the middle?! Yes. A contraction like “It’s” is an example of this. The single curly quote used to indicate that the word is a contraction is a closing single quote. Therefore, our algorithm can be simplified:
  1. Replace all straight quotes with closing quotes of the same type.
  2. Replace closing quotes at the beginning of a word with an opening quote of the same type.
Seems simple enough. However, the composer offers no intelligent search and replace feature. In fact, there’s no search and replace feature at all! We will need to use an editor outside of the composer in Blogger. Text copied and pasted from the composer into a word processor or text editor will lose or corrupt the formatting in the process. We can preserve all of the formatting by switching to the HTML view. I often go into the HTML view to make tweaks that aren’t available through the composer interface. For example, inserting non-breaking spaces (   ) to ensure numbers with units don’t get split across lines.

Ah, but now we have a new problem. The HTML formatting, which was hidden in the composer view, is now visible and will affect the search and replace. For example:

"It's a beautiful day in the neighborhood."

in HTML view is

<div style="text-align: center;">"It's a <b>beautiful</b> day in the neighborhood."</div>

Our previous algorithm will not work. The opening quote is no longer guaranteed to be preceded by a space or newline character. Even worse, there are quotes inside of the HTML tags that cannot be modified. Our algorithm is sound, but we need to account for the HTML tags. Our search and replace function must skip over them.

Regular expressions to the rescue!


There’s a powerful text processing engine we can use called regular expressions or regex for short. Here’s our algorithm using regex:
  1. s/'(?!([^<]+)?>)/’/mg and s/"(?!([^<]+)?>)/”/mg
  2. s/((^|\s+)(<[^>]*>)*)’/\1‘/mg and s/((^|\s+)(<[^>]*>)*)”/\1“/mg
Got all that? Heh, well, if you’re not familiar with regular expressions, that looks like a bunch of gobbledygook. The first step says, “replace all quote characters with a curly closing quote character of the same type as long as there’s no greater-than character after it without a less-than character preceding it.” In other words, make sure you’re not replacing any straight quotes inside of an HTML tag, which is any text in between a greater-than and less-than character. The second step1 replaces all closing quote characters that are at the beginning of a line or if there’s a space preceding it. It also allows any number of HTML tags between the quote character and the space since the HTML tags should be ignored. We don’t need to worry about replacing anything within an HTML tag since the first search and replace took care to only replace straight quotes outside of tags.2

Our algorithm doesn’t take into account nested quote characters. This will probably never come up in a properly formatted document, but something like:

"'It's not too late,' exclaimed Matilda, 'to catch them!'"

Should look like:

“‘It’s not too late,’ exclaimed Matilda, ‘to catch them!’”

However, the first single quote will not be replaced with an opening curly single quote because the single quote is not preceded by a space. We can fix this by adding a third step to the algorithm to change all closing quotes to opening quotes that are immediately followed by an opening quote:
  1. s/((‘|“)(<[^>]*>)*)’/\1‘/mg and s/((‘|“)(<[^>]*>)*)”/\1“/mg
The second step of the algorithm guarantees the first opening quote will always be correct, so all quote characters after it can be changed assuming there’s nothing in between other than HTML tags.

There’s one more gotcha... Look at this ridiculous example:

"'"'"'"'"'Torture Test'"'"'"'"'"
becomes

“‘”’”’”’”’Torture Test’”’”’”’”’”

Why didn’t the other closing quotes before the word “Torture” get changed to opening quotes? After the first single closing quote gets replaced with an opening quote, the search and replace does not use the modified character to evaluate the next quote character. The solution is to keep checking the string after the search and replace commands for more erroneous closing quote characters and re-run the search and replace until the pattern can no longer be found in the string.3

Creating a program.


There are lots of ways to implement this algorithm. Some text editors have a regex engine available in their search and replace function. Of course, the fact that the algorithm requires multiple steps makes that solution unwieldy. I wrote a program4 using AutoIt, which is a freeware BASIC-like scripting language designed for automating the Windows GUI and general scripting. It can be downloaded and installed for free. However, I have a compiled executable available for download at the end of this article that should work on any 32-bit or 64-bit version of Windows. Anyone with a little bit of programming experience can probably understand how this works and implement it in a different language. If you need something that works on an operating system besides Windows, let me know.

Source code:

#include <ButtonConstants.au3>
#include <EditConstants.au3>
#include <GUIConstantsEx.au3>
#include <WindowsConstants.au3>
#include <FontConstants.au3>


Func Curlify($sInput)
    Local $sOutput = StringRegExpReplace($sInput, "'(?!([^<]+)?>)", "’")
    $sOutput = StringRegExpReplace($sOutput, "((^|\s+)(<[^>]*>)*)’", "${1}‘")
    $sOutput = StringRegExpReplace($sOutput, '"(?!([^<]+)?>)', "”")
    $sOutput = StringRegExpReplace($sOutput, "((^|\s+)(<[^>]*>)*)”", "${1}“")
    ; Fix nested quotes
    While StringRegExp($sOutput, "(‘|“)(<[^>]*>)*(’|”)")
        $sOutput = StringRegExpReplace($sOutput, "((‘|“)(<[^>]*>)*)’", "${1}‘")
        $sOutput = StringRegExpReplace($sOutput, "((‘|“)(<[^>]*>)*)”", "${1}“")
    WEnd

    Return $sOutput
EndFunc   ;==>Curlify

$FormCurly = GUICreate("Curlify", 610, 458, 192, 124)
$Original = GUICtrlCreateEdit("", 0, 0, 609, 209)
GUICtrlSetFont(-1, 9, $FW_NORMAL, 0, "Courier New")
GUICtrlSetLimit(-1, 1500000)
$Modified = GUICtrlCreateEdit("", 0, 248, 609, 209)
GUICtrlSetFont(-1, 9, $FW_NORMAL, 0, "Courier New")
GUICtrlSetLimit(-1, 1500000)
$Curly = GUICtrlCreateButton("Curly", 216, 216, 75, 25)
$Reset = GUICtrlCreateButton("Reset", 320, 216, 75, 25)
GUISetState(@SW_SHOW)

While 1
    $nMsg = GUIGetMsg()
    Switch $nMsg
        Case $Curly
            GUICtrlSetData($Modified, Curlify(GUICtrlRead($Original)))

        Case $Reset
            GUICtrlSetData($Original, "")
            GUICtrlSetData($Modified, "")

        Case $GUI_EVENT_CLOSE
            Exit

    EndSwitch
WEnd

Download source and executable: Curlify.zip

Here is the program ported to JavaScript. Just save as an HTML file and open with your web browser:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
  <title>Curlify</title>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <script type="text/javascript">
    //<![CDATA[
    function Curlify() {
      var sOutput = document.getElementById('Original').value;
      sOutput = sOutput.replace(/'(?!([^<]+)?>)/mg, "\u2019");
      sOutput = sOutput.replace(/((^|\s)(<[^>]*>)*)\u2019/mg, "$1\u2018");
      sOutput = sOutput.replace(/"(?!([^<]+)?>)/mg, "\u201D");
      sOutput = sOutput.replace(/((^|\s)(<[^>]*>)*)\u201D/mg, "$1\u201C");
      // Fix nested quotes
      while (sOutput.search(/(\u2018|\u201C)(<[^>]*>)*(\u2019|\u201D)/m) >= 0) {
        sOutput = sOutput.replace(/((\u2018|\u201C)(<[^>]*>)*)\u2019/mg, "$1\u2018");
        sOutput = sOutput.replace(/((\u2018|\u201C)(<[^>]*>)*)\u201D/mg, "$1\u201C");
      }
      document.getElementById('Result').value = sOutput;
    }

    function resetTextAreas() {
        document.getElementById('Original').value = '';
        document.getElementById('Result').value = '';
      }
      //]]>
  </script>
  <style type="text/css">
    /*<![CDATA[*/
    
    p {
      text-align: center;
    }
    /*]]>*/
  </style>
</head>

<body>
  <p>
    <textarea cols="80" id="Original" rows="10"></textarea>
  </p>
  <p>
    <input id="Curlify" type="button" value="Curlify" onclick="Curlify();" />
    <input id="Reset" type="button" value="Reset" onclick="resetTextAreas();" />
  </p>
  <p>
    <textarea cols="80" id="Result" rows="10"></textarea>
  </p>
</body>

</html>

Give it a try:




Notes:

  1. The regex for the second step uses \s+ instead of \s because of AutoIt. I don’t understand why it needs to capture all the spaces to work under certain circumstances, but it does.
  2. I didn’t attempt to change escaped quotes ( &quot; ) inside of tags. It probably wouldn’t be too difficult to do by adding modified rules, but it wasn’t worth the effort to me.
  3. Completely empty quotes like "" don’t get converted to “” and I don’t care to find a solution for that. Besides, word processors don’t seem to convert those either.
  4. I didn’t include any logic in this program to avoid style sheets or other items found in an HTML file where quote characters outside of tags shouldn’t be converted. If you’re working with raw HTML, just copy and paste whatever is in between the <body> tags into the program rather than the whole file contents.

No comments:

Post a Comment