-----Original Message----- From: Weiss, Paul Sent: Friday, April 25, 2003 11:28 AM To: Bickford, Fred; Faria, Mike; chat_cc; tech-cc-all Subject: RE: SOLUTION:182359002(DRAFT) - WINDOWS: Can ClearCase be configured not to use the XML and HTML diffmerge ? Folks, there are trade-offs to changing the default magic file entries. I asked our expert in this area for input on this and have included it below. Please make sure our customers understand the limitations before doing this. I am concerning with customers doing this and then generating calls when they do not follow the rules/limits. ********* As to tradeoffs and benefits: There are several benefits to the XML diff merge tool: * Diffs are independent of XML input formatting - XML Diff Merge has no line length restrictions, and the diffs will appear the same, whether the file is nicely indented, or all on a single line, or in any state in between. Most XML is produced by machine. It is often not formatted to be understood by humans (i.e., not indented in a semantically meaningful way). We have learned from experience that many XML files have either VERY long lines (thousands of characters long), or no line termination at all (e.g., XML generated by MSXML). - Text diff merge has a limitation of around 3000 characters as the max line length. Text_file containers have a limitation of 8000 characters as the max line length. - XML Diff Merge can therefore process some files that text diff merge cannot. - XML Diff Merge parses the XML into a tree, so the result is always "pretty-printed", while maintaining all whitespace exactly as in the original files (e.g., during a merge). * XML Diff Merge understands different XML encodings, especially UTF-16 - If your XML files use UTF-16, you can't use text_file containers, and you can't use the text diff merge tool. UTF-16 looks like "binary" to these tools. - XML Diff Merge can therefore process some files that text diff merge cannot. - XML Diff Merge understands and normalizes several XML encodings. You can even compare a version encoded as UTF-8 vs. one in ASCII vs. UTF-16, and get only "real" differences, not differences caused by the same information being encoded in multiple ways. - When merging, XML Diff Merge can even convert from one encoding to another. For example, if your inputs are UTF-16, you can write the output in UTF-8. Or any other supported encoding. * XML Diff Merge "looks thru the eyes of the XML parser". - XML Diff Merge parses the XML and breaks it down into its syntactic components (e.g., elements, attributes, attribute values). It can, for example, auto-merge differences in attributes, even if those attributes are formatted all on one line. Text diff merge would simply report a conflict. - XML Diff Merge also resolves such XML-isms as character references. For example, a "copyright" character may be placed in a file directly (C) or by character reference (© or ©). When placed directly, the actual sequence of bytes used may be different, according to the encoding used. In XML Diff Merge, you'll always see the C (assuming your font has that glyph, of course), and you don't have to worry about things like encodings and references. - XML Diff Merge is very UNICODE-aware. It is capable of displaying the full 16-bit character space directly, on any system. Most MBCS/I18N apps can display, say, Japanese text ONLY on a machine running *Japanese* Windows. XML Diff Merge can display the Japanese text on an English system (using, say, Arial Unicode MS font that has the Japanese glyphs). - Text diff merge MAY be able to display the Japanese characters if the file contains a BOM. XML Diff Merge will work for any supported XML encoding. - XML Diff Merge can therefore give a correct display for some files that text diff merge cannot. * The "tree view" of XML Diff Merge has the potential for tree-editing - XML is structured as a tree. If you wanted to edit that tree, adding elements or moving them around, for example, you would need to be careful to get both the starting and ending elements, or you would wind up with an invalid XML file. With a tree view, the user could edit that tree without worry of making such errors, no matter how badly the source file was formatted. - I say "potential", however, since these tree-edit operations are mostly not-yet-implemented. But this was part of the idea with going with a tree view. - The tree view also allows expand/collapse of different parts of the tree, and this is often useful for understanding the structure and the diffs. This part *is* implemented today. ********* I suggest we be careful how we word a technote about the current set of limitations/tradeoffs in a solution. -Paul ********** Unfortunately, there are also several serious drawbacks: * The XML Diff Merge algorithm has several serious, fundamental limitations and shortcomings. The current algorithm is often fooled by whitespace differences, for example, and this often makes merging difficult. There are, in fact, a number of scenarios that cause poor diff and merge results. * The algorithm can take a long time to run. There may also be one or more bugs that cause the compute time to be *very* much longer than necessary. * The algorithm can consume a lot of memory. This fact, coupled with the runtime problem above, impose a practical upper limit of about 1Mb on the size of XML files that can be diff/merged. * Algorithm problems can be corrected over time, given resources, etc. * However, since it parses the XML file, XML Diff Merge cannot operate if the XML has a syntax error, or is in an unsupported encoding, or contains some XML structure that requires what is called a "validating" parse (i.e., macro expansion). In such cases, the file cannot be interpreted as XML, so the user must resort to a "lower level" of diff, such as line-oriented text diff. This problem can't really be corrected, except by making it easier to "fall back" to a text diff. The current tool tries hard to make this case work, and to make it easy, but of course it always tries the XML way first. If a user knows *a priori* that the XML way won't work, it would be better to allow them to go straight to the text method. Hope the above information helps Thanks Paul