hdot Image and File Hosting hdot - Upload File xerol.org Homepage xerol.org News/Blog Xerol's Music

hDot Image and File Hosting

Blog Semi-major bug in Office XP
Posted on 23:51:42 10/20/08 by Xerol
Edited 14 times, Last edited 00:13:59 10/21/08

Share this!

Share

At work I was tasked to figure out a problem. Two weeks, a dozen programs, and a lot of frustration later I can finally pin it on Microsoft.


First, the problem: We're processing subtitle files for various movies. These need to be in a very specific format (so they can be rendered properly) and therefore a little glitch (such as an extra character) can throw the entire process off and grind everything to a halt for weeks. Part of the process involved taking the files in Word[1], which had the individual subtitles on multiple lines separated by a blank line, and putting each subtitle on an individual line. This was pretty simple: replace a double newline (^p^p) with some symbol (we used @@@@), replace the remaining single newlines with tabs (^p -> ^t) and then replace the symbol with a single newline. That meant this:

1481 01:22:25,40 --> 01:22:27,8 - Make me look good out there. - OK. 1482 01:22:27,9 --> 01:22:31,178 Marymount, you sons of bitches [insult]. You no-good sons of bitches. 1483 01:22:31,179 --> 01:22:32,947 - You nervous? - Yes.

Became this:

1481 01:22:25,40 --> 01:22:27,8 - Make me look good out there. - OK. 1482 01:22:27,9 --> 01:22:31,178 Marymount, you sons of bitches [insult]. You no-good sons of bitches. 1483 01:22:31,179 --> 01:22:32,947 - You nervous? - Yes.

After which we'd paste the new format into Excel where someone else would do something with the timestamps or something like that. Excel used the tabs to separate what used to be individual lines into multiple columns, which meant everything lined up neatly.

Okay, everything's fine, right? Nope. We were just using massive find-and-replaces to convert the files, but sometimes it wouldn't come out right. The previous example was coming out as:

1481 01:22:25,40 --> 01:22:27,8 - Make me look good out there. - OK. 1482 01:22:27,9 --> 01:22:31,178 Marymount, [...] 1483 01:22:31,179 --> 01:22:32,947 - You nervous? - Yes.

Hmm.[2] Note the double-tab before the number 1482.

Just doing a simple search for ^p^p skipped over the gap before 1482 (and 10 other places, scattered seemingly randomly throughout the file). But the individual newlines were being converted to tabs, so they were still being seen individually.

I'm going to cut out two weeks of troubleshooting and give the results now, since I don't feel like reliving those two weeks. Eventually we opened up the files in a hex editor, both the originals and several resaved versions in both raw text and word document. I discovered several things:

  • Word parses text files in 256-byte chunks.
  • Word converts the CR/LF[3] combination (0D0Ah) to a single newline character. When resaving to text, it saves these once again as CR/LF. When saving as a Word document, it only saves a single CR (0Dh) character.
  • When loading a pair of CR/LFs across a 256-byte boundary (specifically, when the last LF is the first byte of a new chunk), it loads it as follows:
    1. Sees CR/LF, converts to newline.
    2. Sees CR, end of chunk. Puts CR in memory but not visibly in document.
    3. Sees LF, for some reason converts to a full newline.
    End result: Two newlines with an invisible character between.
  • When saving as a text file, it discards the "half newline" and saves it fine.
  • When saving as a Word document, it converts the extra character in memory to a regular newline (which it then saves in the file as three CRs in a row (0D0D0Dh)).

This only happened when there are two CR/LFs in a row. Single newlines across chunk boundaries never ended up being problematic.

Our solution ended up being to bring the files into Excel first (as a ton of individual lines), replacing every blank row with a single space, and pasting the result into Word. When converting to single-line format, instead of searching for ^p^p we'd search for ^p[space]^p. Finally, before sending it off to be rendered, we remove the spaces and then immediately save as a text document. This seems to work perfectly.[4]

I'm not sure if this is still a problem in any newer version of Office. I'd submit a bug report but that just looks like another problem and I'm perfectly content with just finding out a way around it for now.

Notes

  1. You may be asking, "Why Word?" A large confluence of existing processes left me with little choice here, including a major problem with OpenOffice not supporting a specific Chinese character set (and the subtitle program doesn't like Unicode, which unfortunately is outside of my scope). To be more specific, it would save into the proper character set but only read the file as regular ASCII. Moreover, we could copy & paste Chinese text into OO and it would work perfectly fine up until we tried to re-open the file. It couldn't even read the files it saved, so I spent my energy on figuring out why Word was giving problems.
  2. It didn't truncate the line, but it did make this page really wide, so I truncated it myself. You can still see what's happening.
  3. Carriage Return/Line Feed. Old line printers were entirely controlled by regular characters embedded in the text stream, so printing to the next line meant a carriage return (which moved the printhead back to the left of the printer) and a line feed (which moved the paper). The linefeed was almost always faster to do, so the CR was sent first to give the printhead time to return to the left side of the paper.
  4. This isn't entirely the most efficient solution either. Right now I'm working on a program that'll just insert the spaces in a single shot (replacing 0D0A0D0Ah with 0D0A200D0Ah). Also, for reasons I never was able to figure out, if you simply replaced all ^p with any other character (say, %) and then searched for two immediate copies of that character (%%) it would always come up - the "phantom" line break disappeared. But it would come back if you just replaced it with a ^p immediately afterward. This ends up being fairly quick, although working off a network drive still makes this go really slow at times for some reason. (Basically, replace ^p with %, %% with ^p^p, and % with ^p.)

Comments

Posted on 01:56:46 11/28/07 by Xerol
By the way, this isn't exclusive to Office XP. I've confirmed it in Word 2000 and 2003. If anyone wants to try it out in 2007, let me know and I'll send you some files to test with.
Posted on 01:57:26 11/28/07 by Guest
This is one reason that made me flee the computing scene...made me move more towards math, lol
Posted on 10:15:10 12/03/07 by Xerol
Well, I've also managed to confirm that it's still an issue in Word 2007. So, here come the emails to Microsoft.
Site Navigation
Xerol's Music

Code (c) 2006-9 Xerol

Contact: xerol@xerol.org - Put 'hdot' in subject line

Total hits since June 2006: 14927019