Thursday, August 19, 2010

A Financial Translator Dabbles in Machine Translation: Can Google Translator Toolkit Replace a Licensed CAT tool?

After a spell in which I reinstalled Windows several times in the space of a few months because of an array of hardware/software problems and the purchase of a new PC, I basically decided not to trust computer hard drives with long-term storage. I now consider my main PC’s hard drive as an empty shell for temporary storage of non-essential items and some of the most recent files I use. The heavy lifting of my real storage is done by my Carbonite account (which backs up my data in the cloud, for a fee after 2GB) and external hard drives. Redundant? Yes, but I now no longer worry about my backup failing.
Now, every time I encounter intractable PC problems (at least for a user with middling computer literacy), I simply reformat my HD and reinstall my OS, which, of course, erases any data on it. This, however, creates the inconvenience of also erasing all my programs and settings, forcing me to spend an entire day re-teaching my PC to think like me. This is particularly uncomfortable for my CAT tools, which feature relatively roundabout ways of reinstalling licenses in order to avoid piracy. Wordfast forces you to reinstall, obtain an install number, visit their website, punch in an account name and password, and “re-license” your product by obtaining a new license number for your install number. Trados’s softkey procedure similarly implies logging in to their support site using an account password, downloading a new .txt document, saving it on your hard drive and then navigating to it from Workbench. The program then identifies the file as a new license and unlocks the program for use with large TMs.
Needless to say, these procedures, while simple, are never hassle-free. There are always hiccups along the way and several restarts are necessary before the software starts to work.
This led me to wonder about CAT tools in the cloud-computing sky. After all, cloud-based software is the wave fo the future. Eventually, our hard drives will not contain hardly any software. Our PCs will simply be a hub to connect with all the programs we need, lodged in the servers of software providers. Many of us use e-mail that way, never downloading messages to our PCs but rather reading and storing messages via our browsers.
To avoid the mess of reinstalling software on the latest reinstallation of my Windows environment, I decided to try out Google Translator Toolkit (GTT). I heard of Lingotek a few years back, which featured sharing of large TMs among thousands of translators, but they have since adopted a B2B business model, apparently, and only provide maintenance of the tool to people who originally signed up for it.
I was familiar with Google’s machine translation through the plug-in in Wordfast. And I knew the expanded Toolkit featured some limited CAT capabilities. I decided to give it a spin as a CAT tool to see whether it could supplant paid programs that require fussy re-licensing procedures. As many know, GTT combines certain aspects of machine translation with CAT capabilities. You can upload a small translation memory (TM) of less than 50 GB. The downside is that this material automatically becomes the property of Google Inc., with the attendant confidentiality problems. The work of the users uploading TMs and processing sentences in the Google environment is then used by the company to enrich the quality of its own machine translations (MT).
The conclusion, sadly, is that GTT is not ready for prime time as a replacement of CAT tools. The main drawback: the way it scans tables and visual elements of Word docs. Instead of leaving visual elements in the same state as in the original Word document, it scans it partially or whollly and then translates elements that do not need translation (such as letterheads), forcing the user to backtrack over the document at the end of the project and manually insert translations, reinsert originals, make little tweaks here and there, copy-paste, paste-copy, etc.
The key word here is “manual”. Everyone knows that anything that has to be done manually on a computer augments exponentially the amount of mistakes.  So that is a strike against the use of GTT as a replacement of heftier CATs (or TEnTs, as they are also known).
Another major drawback: the tool translates everything in the text, as opposed to one sentence at a time the way CAT tools do, which usually work segment by segment. This can create a lot of headaches. GTT doesn’t contain an “Insert original” (like say CTRL+C or CTRL+O) option. Which forces the professional translator to cut and paste (a lot in some cases).
Another (rather bizarre) aspect of Google Translator Toolkit: the Help info in other languages is… well… (how shall I put this?)… er… translated by Google Translate. Which means two out of three sentences are complete gibberish. Of course, this makes sense in some sort of twisted Silicon-Valley way only a Sheldonian computer engineer would understand. “If we’re designing a machine translation tool, wouldn’t it be hypocritical to use a human editor to translate the output in our Help files spit out by that very MT tool?” Well, not if you want your users to actually embrace the tool. This kind of laudable intellectual honesty mixed with utter idiocy is probably a major reason why more translators don’t embrace projects such as Google’s GTT.
Finally, I hesitate to mention this… After all, it is a free tool. You get what you pay for, right? I (rather naively) sent a message making suggestions to improve the tool to the contact address of the team devoted to maintaining GTT and promptly got a “Delivery Status Notification (Failure)” reply from the server. The developer’s contact addresses no longer exist on the Google servers (!), which may mean that any further improvement or expansion of GTT has been either called off or postponed indefinitely. Hardly encouraging.
In any case, TEnT developers do not need to lose any sleep… at least for now. I guess the option for me would be to try out open source CAT tools, since by definition they do not require messy re-licensing. The problem is I am not familiarized with any, so self-training from ground zero would be a requisite. Furthermore, I do not even know if they require the use of Linux, with which I am totally unfamiliarized. We shall see…


Luke said...

Thanks for the overview. Very interesting about the dev email not working any more. I have also tried out GTT, with similar results. I do genuinely think that Google have the resources to change CAT usage. But I suppose that change is yet to come.

OmegaT is heartily recommended. It works on all platforms, although you do need to convert all files to the Open Office "open format" before working on them. This only very rarely causes formatting issues and normally very simple ones.

Apart from that you get a great product that creates TMX files, keeps your projects in order and is blazing fast to use.

As a final note, I also believe I saw that Corinne McKay uses OmegaT in a recent blog post. As chair of the ATA PR Committee, I'd say that's a solid green light to anyone interested in trying it!

Iwan Davies said...

The OmegaT CAT tool is open source and as it's written in Java, it runs on Windows, Linux and Mac. Well supported too.

If you're reasonably adept at learning new software, you can be up and running with OmegaT within a couple of weeks, and it is compatible with TMX so you should be able to access existing translation memories in that format.