Straight quotes are the two generic vertical quotation marks located near the return key: the straight single quote ( ' ) and the straight double quote ( " ).
Curly quotes are the quotation marks used in good typography. There are four curly quote characters: the opening single quote ( ‘ ), the closing single quote ( ’ ), the opening double quote ( “ ), and the closing double quote ( ” ).
Search and replace algorithm
Doing a search and replace seems like an easy task at first. However, we need to replace two kinds of straight quotes with four kinds curly quotes. How do we know which to use? Opening quotes go at the beginning of a word and closing are used at the end or in the middle of a word. In the middle?! Yes. A contraction like “It’s” is an example of this. The single curly quote used to indicate that the word is a contraction is a closing single quote. Therefore, our algorithm can be simplified:
- Replace all straight quotes with closing quotes of the same type.
- Replace closing quotes at the beginning of a word with an opening quote of the same type.
) to ensure numbers with units don’t get split across lines.Ah, but now we have a new problem. The HTML formatting, which was hidden in the composer view, is now visible and will affect the search and replace. For example:
"It's a beautiful day in the neighborhood."
in HTML view is
<div style="text-align: center;">"It's a <b>beautiful</b> day in the neighborhood."</div>
Our previous algorithm will not work. The opening quote is no longer guaranteed to be preceded by a space or newline character. Even worse, there are quotes inside of the HTML tags that cannot be modified. Our algorithm is sound, but we need to account for the HTML tags. Our search and replace function must skip over them.
Regular expressions to the rescue!
There’s a powerful text processing engine we can use called regular expressions or regex for short. Here’s our algorithm using regex:
s/'(?!([^<]+)?>)/’/mg
ands/"(?!([^<]+)?>)/”/mg
s/((^|\s+)(<[^>]*>)*)’/\1‘/mg
ands/((^|\s+)(<[^>]*>)*)”/\1“/mg
Our algorithm doesn’t take into account nested quote characters. This will probably never come up in a properly formatted document, but something like:
"'It's not too late,' exclaimed Matilda, 'to catch them!'"
Should look like:
“‘It’s not too late,’ exclaimed Matilda, ‘to catch them!’”
However, the first single quote will not be replaced with an opening curly single quote because the single quote is not preceded by a space. We can fix this by adding a third step to the algorithm to change all closing quotes to opening quotes that are immediately followed by an opening quote:
s/((‘|“)(<[^>]*>)*)’/\1‘/mg
ands/((‘|“)(<[^>]*>)*)”/\1“/mg
There’s one more gotcha... Look at this ridiculous example:
"'"'"'"'"'Torture Test'"'"'"'"'"
“‘”’”’”’”’Torture Test’”’”’”’”’”
Why didn’t the other closing quotes before the word “Torture” get changed to opening quotes? After the first single closing quote gets replaced with an opening quote, the search and replace does not use the modified character to evaluate the next quote character. The solution is to keep checking the string after the search and replace commands for more erroneous closing quote characters and re-run the search and replace until the pattern can no longer be found in the string.3
Creating a program.
There are lots of ways to implement this algorithm. Some text editors have a regex engine available in their search and replace function. Of course, the fact that the algorithm requires multiple steps makes that solution unwieldy. I wrote a program4 using AutoIt, which is a freeware BASIC-like scripting language designed for automating the Windows GUI and general scripting. It can be downloaded and installed for free. However, I have a compiled executable available for download at the end of this article that should work on any 32-bit or 64-bit version of Windows. Anyone with a little bit of programming experience can probably understand how this works and implement it in a different language. If you need something that works on an operating system besides Windows, let me know.
Source code:
#include <ButtonConstants.au3>
#include <EditConstants.au3>
#include <GUIConstantsEx.au3>
#include <WindowsConstants.au3>
#include <FontConstants.au3>
Func Curlify($sInput)
Local $sOutput = StringRegExpReplace($sInput, "'(?!([^<]+)?>)", "’")
$sOutput = StringRegExpReplace($sOutput, "((^|\s+)(<[^>]*>)*)’", "${1}‘")
$sOutput = StringRegExpReplace($sOutput, '"(?!([^<]+)?>)', "”")
$sOutput = StringRegExpReplace($sOutput, "((^|\s+)(<[^>]*>)*)”", "${1}“")
; Fix nested quotes
While StringRegExp($sOutput, "(‘|“)(<[^>]*>)*(’|”)")
$sOutput = StringRegExpReplace($sOutput, "((‘|“)(<[^>]*>)*)’", "${1}‘")
$sOutput = StringRegExpReplace($sOutput, "((‘|“)(<[^>]*>)*)”", "${1}“")
WEnd
Return $sOutput
EndFunc ;==>Curlify
$FormCurly = GUICreate("Curlify", 610, 458, 192, 124)
$Original = GUICtrlCreateEdit("", 0, 0, 609, 209)
GUICtrlSetFont(-1, 9, $FW_NORMAL, 0, "Courier New")
GUICtrlSetLimit(-1, 1500000)
$Modified = GUICtrlCreateEdit("", 0, 248, 609, 209)
GUICtrlSetFont(-1, 9, $FW_NORMAL, 0, "Courier New")
GUICtrlSetLimit(-1, 1500000)
$Curly = GUICtrlCreateButton("Curly", 216, 216, 75, 25)
$Reset = GUICtrlCreateButton("Reset", 320, 216, 75, 25)
GUISetState(@SW_SHOW)
While 1
$nMsg = GUIGetMsg()
Switch $nMsg
Case $Curly
GUICtrlSetData($Modified, Curlify(GUICtrlRead($Original)))
Case $Reset
GUICtrlSetData($Original, "")
GUICtrlSetData($Modified, "")
Case $GUI_EVENT_CLOSE
Exit
EndSwitch
WEnd
Download source and executable: Curlify.zip
UPDATE 2017-10-17: I added two new buttons to the AutoIt version: “Copy from Clipboard” and “Paste to Clipboard.” Download the updated zip file above.
Here is the program ported to JavaScript. Just save as an HTML file and open with your web browser:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Curlify</title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <script type="text/javascript"> //<![CDATA[ function Curlify() { var sOutput = document.getElementById('Original').value; sOutput = sOutput.replace(/'(?!([^<]+)?>)/mg, "\u2019"); sOutput = sOutput.replace(/((^|\s)(<[^>]*>)*)\u2019/mg, "$1\u2018"); sOutput = sOutput.replace(/"(?!([^<]+)?>)/mg, "\u201D"); sOutput = sOutput.replace(/((^|\s)(<[^>]*>)*)\u201D/mg, "$1\u201C"); // Fix nested quotes while (sOutput.search(/(\u2018|\u201C)(<[^>]*>)*(\u2019|\u201D)/m) >= 0) { sOutput = sOutput.replace(/((\u2018|\u201C)(<[^>]*>)*)\u2019/mg, "$1\u2018"); sOutput = sOutput.replace(/((\u2018|\u201C)(<[^>]*>)*)\u201D/mg, "$1\u201C"); } document.getElementById('Result').value = sOutput; } function resetTextAreas() { document.getElementById('Original').value = ''; document.getElementById('Result').value = ''; } //]]> </script> <style type="text/css"> /*<![CDATA[*/ p { text-align: center; } /*]]>*/ </style> </head> <body> <p> <textarea cols="80" id="Original" rows="10"></textarea> </p> <p> <input id="Curlify" type="button" value="Curlify" onclick="Curlify();" /> <input id="Reset" type="button" value="Reset" onclick="resetTextAreas();" /> </p> <p> <textarea cols="80" id="Result" rows="10"></textarea> </p> </body> </html>
Give it a try:
Notes:
- The regex for the second step uses
\s+
instead of\s
because of AutoIt. I don’t understand why it needs to capture all the spaces to work under certain circumstances, but it does. - I didn’t attempt to change escaped quotes (
"
) inside of tags. It probably wouldn’t be too difficult to do by adding modified rules, but it wasn’t worth the effort to me. - Completely empty quotes like "" don’t get converted to “” and I don’t care to find a solution for that. Besides, word processors don’t seem to convert those either.
- I didn’t include any logic in this program to avoid style sheets or other items found in an HTML file where quote characters outside of tags shouldn’t be converted. If you’re working with raw HTML, just copy and paste whatever is in between the
<body>
tags into the program rather than the whole file contents.
No comments:
Post a Comment