At work I was tasked to figure out a problem. Two weeks, a dozen programs, and a lot of frustration later I can finally pin it on Microsoft.
First, the problem: We're processing subtitle files for various movies. These need to be in a very specific format (so they can be rendered properly) and therefore a little glitch (such as an extra character) can throw the entire process off and grind everything to a halt for weeks. Part of the process involved taking the files in Word[1], which had the individual subtitles on multiple lines separated by a blank line, and putting each subtitle on an individual line. This was pretty simple: replace a double newline (^p^p) with some symbol (we used @@@@), replace the remaining single newlines with tabs (^p -> ^t) and then replace the symbol with a single newline. That meant this:
Became this:
After which we'd paste the new format into Excel where someone else would do something with the timestamps or something like that. Excel used the tabs to separate what used to be individual lines into multiple columns, which meant everything lined up neatly.
Okay, everything's fine, right? Nope. We were just using massive find-and-replaces to convert the files, but sometimes it wouldn't come out right. The previous example was coming out as:
Hmm.[2] Note the double-tab before the number 1482.
Just doing a simple search for ^p^p skipped over the gap before 1482 (and 10 other places, scattered seemingly randomly throughout the file). But the individual newlines were being converted to tabs, so they were still being seen individually.
I'm going to cut out two weeks of troubleshooting and give the results now, since I don't feel like reliving those two weeks. Eventually we opened up the files in a hex editor, both the originals and several resaved versions in both raw text and word document. I discovered several things:
- Word parses text files in 256-byte chunks.
- Word converts the CR/LF[3] combination (0D0Ah) to a single newline character. When resaving to text, it saves these once again as CR/LF. When saving as a Word document, it only saves a single CR (0Dh) character.
- When loading a pair of CR/LFs across a 256-byte boundary (specifically, when the last LF is the first byte of a new chunk), it loads it as follows:
- Sees CR/LF, converts to newline.
- Sees CR, end of chunk. Puts CR in memory but not visibly in document.
- Sees LF, for some reason converts to a full newline.
- When saving as a text file, it discards the "half newline" and saves it fine.
- When saving as a Word document, it converts the extra character in memory to a regular newline (which it then saves in the file as three CRs in a row (0D0D0Dh)).
This only happened when there are two CR/LFs in a row. Single newlines across chunk boundaries never ended up being problematic.
Our solution ended up being to bring the files into Excel first (as a ton of individual lines), replacing every blank row with a single space, and pasting the result into Word. When converting to single-line format, instead of searching for ^p^p we'd search for ^p[space]^p. Finally, before sending it off to be rendered, we remove the spaces and then immediately save as a text document. This seems to work perfectly.[4]
I'm not sure if this is still a problem in any newer version of Office. I'd submit a bug report but that just looks like another problem and I'm perfectly content with just finding out a way around it for now.
Notes
- You may be asking, "Why Word?" A large confluence of existing processes left me with little choice here, including a major problem with OpenOffice not supporting a specific Chinese character set (and the subtitle program doesn't like Unicode, which unfortunately is outside of my scope). To be more specific, it would save into the proper character set but only read the file as regular ASCII. Moreover, we could copy & paste Chinese text into OO and it would work perfectly fine up until we tried to re-open the file. It couldn't even read the files it saved, so I spent my energy on figuring out why Word was giving problems.
- It didn't truncate the line, but it did make this page really wide, so I truncated it myself. You can still see what's happening.
- Carriage Return/Line Feed. Old line printers were entirely controlled by regular characters embedded in the text stream, so printing to the next line meant a carriage return (which moved the printhead back to the left of the printer) and a line feed (which moved the paper). The linefeed was almost always faster to do, so the CR was sent first to give the printhead time to return to the left side of the paper.
- This isn't entirely the most efficient solution either. Right now I'm working on a program that'll just insert the spaces in a single shot (replacing 0D0A0D0Ah with 0D0A200D0Ah). Also, for reasons I never was able to figure out, if you simply replaced all ^p with any other character (say, %) and then searched for two immediate copies of that character (%%) it would always come up - the "phantom" line break disappeared. But it would come back if you just replaced it with a ^p immediately afterward. This ends up being fairly quick, although working off a network drive still makes this go really slow at times for some reason. (Basically, replace ^p with %, %% with ^p^p, and % with ^p.)