1. CP/M Oddities
The DOS files were written in standard Intel assembly language syntax, but some CP/M files used a variation I call DRI assembler that was created at DRI while other files were written in the PL/M programming language developed at DRI. In particular, I found that an exclamation point could be used to separate multiple instructions on a single line. I eventually found an assembler user’s guide from DRI  that confirmed this syntax.
1.1. Cleaning the Code
For CP/M version 1.3, the code consisted of low-resolution PDF scans of dot matrix printouts of source code. I performed a number of processes to recover the source code from the scans as best as could be done. These steps are described below.
1.2. Remove Things That Are Not Source Code
There are stamps on each page indicating that the code copyrighted by Digital Research in 1976. Each stamp needed to be cut out from the document. Where a stamp was on top of code, and cutting out the stamp removed source code text, the underlying text was rebuilt using characters copied from other sections of code to exactly replace what could be seen under the stamp. There were also memory locations and machine code hex on the left margins―these scans were obviously printouts of assembler listings showing the generated machine code and where the code had been located in memory after assembly. I manually cut out line numbers on the left margins and memory maps that were not source code.
Also, the scans had dots and smudges that were either due to scans of multi-gener- ation photocopies, ink spraying from the printer, or dirt from handling the pages over the years. I went through each page and digitally erased all dots and smudges to improve the OCR reliability.
Some of the code ran off the printed page. Usually these were comments, which did not affect the functionality of the code but might have contained potential clues to copying. Unfortunately, without other printouts or the original code, this missing code could not be replaced.
1.3. Optical Character Recognition (OCR)
I used the ABBYY FineReader program to perform OCR scanning on each page of each PDF of source code. Several passes of manual corrections were needed where the OCR did not produce good results, usually because the printouts were not clear.
1.4. Fix Printer Glitches
There were a number of errors that were introduced by problems with the printer that was used to print the pages. These took a while to figure out because while some of the glitches were obvious, others were masquerading as strange code syntax. One easy glitch to figure out was in file BDOS. plm, where I found following gibberish at lines 193 - 194:
Examining the code before and after the gibberish, I could discern a simple pattern and determined the correct code and substituted it for the gibberish:
END SELSEC; READ$DISK: PROCEDUREBYTE;
Another problem with the printer caused some words to occasionally print with a duplicate letter at the end, like SCANN, OPENN, and MOVV. I discovered this when I noticed that these variables could not be found elsewhere in the code or these instructions were not valid DRI assembly instructions, but were correct without the extra letter on the end. When I found these variables and instructions, I deleted the extra letter.
In the PL/M files, there were extra letters “N” and “D” at the beginning of some lines like NDECLARE and DDECLARE. These are not valid PL/M statements, though DECLARE is a valid PL/M statement. I figured out this printer anomaly when I saw a procedure called NDISKMON that ended with the statement END DISKMON. So if I found a PL/M instruction or identifier that would only be valid without that initial letter, I removed the initial letter.
1.5. Run CodeMatch of Each File against Itself
I found that by running the CodeMatch function of CodeSuite to compare files of a particular language (assembler or PL/M) against itself, I could find additional problems with the OCR scans. Each time I found a problem this way, I would correct it and rerun CodeMatch. I continued this process until I could find no more errors. The types of problems I found are described below.
1.5.1. Comments as Instructions
CodeMatch listed some comments as instructions. This meant that there was a missing comment delimiter that needed to be added back in.
1.5.2. Instructions as Comments
CodeMatch listed some instructions as comments. This also meant that there was a missing comment delimiter that needed to be added back in, though there were cases where an instruction was commented out, so each case needed to be examined individually to determine whether it was correct or whether it was an OCR problem to be corrected.
1.5.3. Strange Identifiers
Some identifiers seemed wrong because, for example, they looked like common words that were not spelled correctly. I examined these identifiers in the original scans, determined the correct identifier, and fixed it in the code.
1.5.4. Incorrect OCR
I searched through the files for the letter “O” within numbers and changed it to the numeral “0”. I checked the original scan before making the correction.
I also searched for the numeral “0” within identifier names. If it was at the end of the identifier, it was probably correct. If it was part of a word then it should probably be the letter “O”. I checked the original scan before making the correction.
I also searched for the letter “W” and changed it to letter “U” if necessary. This could be seen in words where the word was nonsensical with a “W” but made sense with a “U”. I checked the original scan before making the correction.
1.5.5. Reformatted Code
To make the assembly code more readable, I used the program asmbc.exe from the website 8051 assembly formatter  to beautify the assembly code, making it more readable. Even though this program is intended for use on Intel 8051 assembly code, it works well on Intel x 86 assembly code as well, which I manually checked by using a diff between the original code and the beautified code. This formatter program simply lined up labels, instructions, and comments by adding or subtracting whitespace. I also made edits by hand, but other than whitespace, and the changes listed above, I did not make further changes to the code.
To make the PL/M code more readable, I created an AWK script to format the code. The AWK script, and a batch file to run it on a Windows machine, is given in the tools folder that can be downloaded from the link at the end of this paper.
2. Code Comparisons
I used the CodeSuite® tool from my software company Software Analysis and Forensic Engineering and followed the procedures that I have written about in my textbook on software forensics  and that have been used at my company Zeidman Consulting in over 80 software copyright litigation cases. The purpose of this procedure is to find all of the correlation between the two sets of code and then eliminate the correlation that can be explained by reasons other than copying: commonly used identifier names, common algorithms, common author, automatically generated code, and third party code. Any correlation that cannot be explained by one of these five reasons must have been copied. It is important to remember that all of these five kinds of correlations could have been due to copying, but copying cannot be reasonably proven. If some correlation can only be reasonably explained by copying, then that is proof of copying, and it makes sense to go back and look at other correlation that had previously been filtered out, to determine the extent of the copying.
The steps in the procedure are:
1) Use the FileIdentify™ function of CodeSuite to search the source code directories for source code files and determine the programming languages used.
2) Load the source tree into the Understand tool from Scientific Toolworks and review for errors and warnings to determine that the code is not corrupted and to determine whether files and functions are missing.
3) Perform global searches within the source code files for the following terms:
4) The string copyright.
5) Company names.
6) Author names and initials.
7) Any relevant terms.
8) Run the CodeMatch® function of CodeSuite on all programming language files; export the resulting CodeMatch databases to HTML reports and inspect the most highly correlated file pairs.
9) Run the SourceDetective® function of CodeSuite on the CodeMatch databases to determine the frequency of matching program elements (identifiers, statements, comments, and strings) on the Internet.
10) Produce search spreadsheets showing the number of times matching program elements can be found on the Internet.
11) Filter out the matching program elements with high search counts. Focus on matches with low search count.
12) Filter out any program elements with low but unimportant hit count matches.
13) Inspect the most highly correlated file pairs.
14) Create a spreadsheet of partially matching identifiers to find any unusual ones and examine the surrounding code.
15) Run the CodeCross® function of CodeSuite; export the resulting Code Cross databases to HTML reports and inspect the most highly correlated file pairs.
16) Run the SourceDetective function of CodeSuite on the CodeCross databases to determine the frequency of cross-matching program elements (statements, comments, and strings) on the Internet.
17) Produce search spreadsheets showing the number of times cross-matching program elements can be found on the Internet.
18) Filter out the cross-matching program elements with high search counts. Focus on matches with low search count.
19) Filter out any cross-matching program elements with low but unimportant hit count matches.
20) Inspect the most highly correlated file pairs.
21) Draw conclusions.
2.1. Run FileIdentify
FileIdentify is a function of the CodeSuite program that identifies the number of file types in a folder and reports which programming language is typically associated with each file type. There is nothing to prevent someone from mislabeling a file as a type containing code in one programming language when it really contains code in a different programming language, and FileIdentify does not actually do a semantic analysis to determine the programming language, but in this case, opening the files revealed that the file types are indeed correct. The file types are listed in Table 1 for each version of
Table 1. CP/M files.
*These files are not source code as determined by their extensions and opening them up. **These files are assembly code for a Z80-based CP/M simulator developed in 2007, as determined by the code and the comments in the files.
the CP/M operating system to be compared. The file types are listed in Table 2 for each version of the DOS operating system to be compared.
2.2. Run Understand
Understand is a program from Scientific Tool works that analyzes source code and reports the relationships between functions and files. Understand reported 114 errors in the PL/M code, which seems to be because this code conforms to an older version of PL/M that Understand does not fully recognize. Understand cannot analyze assembly code so it could not be used to analyze the assembly code.
2.3. Perform Global Searches
I searched the source code files for terms that could be clues to copying.
2.3.1. Search for the String “Copyright”
The CP/M files all had copyright notices for Digital Research and Gary Kildall. The DOS files had copyright notices for Seattle Computer Products, IBM, Tele Video Systems, or Microsoft.
The Seattle Computer Products copyright notice is found in a comment the file ASM. ASM in the DOS 1.1 source code. The exact code is:
DB 13, 10, “Copyright 1979-1983 by Seattle Computer Products, Inc.”
Seattle Computer Products was the hardware company that hired Tim Paterson to write an operating system, called QDOS, that was eventually purchased by Microsoft and turned into DOS, so it makes sense for this notice to be in the code.
The Tele Video copyright notice is found in a comment the file UINIT. ASM in the DOS 2.0 source code. The exact code is:
IF IBM; HEADER DB 13,10,13,10, “Tele Video Personal Computer DOS Vers. 2.11”, 13, 10; DB “(C) Copyright Tele Video Systems, Inc. 1983”, 13, 10; DB “(C) Copyright Microsoft Corp. 1981, 1982, 1983”, 13, 10, “$”; ENDIF.
Table 2. DOS files.
*These files are not source code as determined by their extensions and opening them up.
TeleVideo was a company that manufactured computer terminals. In the early 1980 s, it also built CP/M and DOS computers, including the Model TS-1603 that ran both DOS 2.0 and CP/M-86 1.1  .
2.3.2. Search for the Company Names
The CP/M files had mentions of Digital Research. The DOS files had mentions of Seattle Computer Products, IBM, TeleVideo Systems, and Microsoft. A case-insensitive search for the following terms in the DOS code did not produce any results.
DRI (searched for whole word only)
Research (found two generic program labels)
2.3.3. Search for Author Names and Initials
The CP/M files had mentions of Gary Kildall while the DOS files had mentions of Tim Paterson. A case-insensitive search for the following terms in the DOS code did not produce any results.
2.3.4. Search for Any Relevant Terms
Interestingly, a search for the terms CP/M and CPM did find some results in the DOS source code.
In file MSDOS.ASM in DOS 1.0:
; 1.12 10/09/81 Zero high half of CURRENT BLOCK after all (CP/M programs don’t) . . . STOSB; Set it to zero (CP/M programs set low byte).
In file MSHEAD.ASM in DOS 2.0:
; 1.12 10/09/81 Zero high half of CURRENT BLOCK after all (CP/M programs don’t).
And in the file SYSCALL.ASM in DOS 2.0:
STOSB; Set it to zero (CP/M programs set low byte).
My research on the Internet and my reading of the code led me to believe that the code above has something to do with the file system. Because it discusses differences between DOS and CP/M, it would not be reasonable to interpret this as a clue that the code was copied from CP/M.
I also found the following reference to CP/M in file EXEC.ASM in DOS 2.0:
XORAX, AX; zero extent, etc for CPM.
And in files PRINT.ASM and PRINT_v211.ASM I found:
DOCHAR: MOV AL, BYTE PTR [BX] CMP AL, 1AH ; ^Z? JZ FILEOFJ ; CPM EOF CMP AL, 0DH ; CR? JNZ NOTCR MOV [COLPOS], 0
And in file PRINT_V211.ASM I found:
JZFILEOFJ; CPM EOF.
The CP/M file system used fields called “extents” to keep track of files in directories. The sizes of CP/M files were stored in “sectors” of 128 bytes each. If a file filled up less than the 128 bytes of the last sector, the other bytes were filled with an ASCII Control-Z character as an end-of-file marker (EOF)   .
DOS had a different way of keeping track of file information. It recorded file sizes in bytes and so no EOF marker was needed. The code above seems to indicate that DOS could read CP/M files and had special code to do so, but initial research showed that CP/M files were incompatible with DOS. Was this a clue to copying?
Further research showed that very early versions of DOS were designed to read and write CP/M files. The code above confirms that compatibility  . Eventually that compatibility was dropped from DOS. The mention of CP/M in DOS makes sense once this purposeful compatibility is recognized. It is not a sign of copying.
2.4. Run CodeMatch and Inspect Most Highly Correlated File Pairs
Because CP/M is written in two different languages, two comparisons needed to be run. First, all DOS assembly code was compared to all CP/M assembly code. Second, all DOS assembly code was compared to all CP/M PL/M code.
2.4.1. DOS Assembly Code to CP/M Assembly Code
Examples and discussions of the matching elements between DOS and CP/M assembly code are given below.
1) Matching statements
Some examples of matching statements are shown in Appendix A. The first example shows that the constant TRUE is set to NOT FALSE. This is logical and would not be a sign of copying, especially since the line above shows that the constant FALSE is set to different values in DOS and CP/M.
In the second example, the label DELIM is found in both programs, which is a common abbreviation for the word “delimiter” that is a common programming term for a character that separates sections of a string of characters. The routines in both programs are examining characters of a string, and comparing them to find specific characters, but the routines are searching for different characters and thus not an indicator of copying.
In the third example, the statement DW RENAME is found in both programs, which reserves a word in memory for a variable called RENAME. In the CP/M code, this variable is used to store information about one of the operating system commands while in the DOS code it points to one of many DOS system calls. Given the different functionality, this is not an indicator of copying.
In the fourth example, the labels COMERR and COMERR1 are found. Both routines process command errors, but the code can be seen to be significantly different other than these two labels. In fact, the CP/M code has an additional label COMERR0 that is not found in the DOS code. Given the different functionality, this is not an indicator of copying.
In the fifth, sixth, and seventh examples, there are conditional jump instructions (JC, JZ, and JNZ) to identically labelled sections of code (COMERR, GETOP, SE2). However, the code surrounding these instructions are significantly different and these matching instructions are thus not indicators of copying.
These matching statements, along with others, were examined, and none of them appeared to be correlated for any reason other than common programming terms that could be expected to be found in many programs and are thus not indicators of copying.
2) Matching comments and strings
Some examples of matching comments and strings are shown in Appendix B. The first comment is Get next character. Looking at the surrounding code, the routines are very different, and thus not an indicator of copying.
In the second example, the terms DIR, REN, and TYPE are found in both sets of source code. In both sets of code they are multiple byte variables. However, in the DOS code, DIR and REN are 4 bytes while TYPE is 5 bytes. In the CP/M code they are all 4 bytes. They are also listed in a different order. When code is copied, it is rarely reordered because there is no need to do so. Both sets of code contain other commands that do not match. And there commands were well known commands in operating systems at the time. Also note that these commands are the “intrinsic commands” that are processed by the operating system command processor code. Every other command had its own executable file and source code file. For example, the DDT and ED commands in CP/M had source code files DDT.ASM and ED.ASM and executable files DDT.COM and ED.COM respectively. While CP/M 1.3 implemented 5 commands intrinsically1, DOS 1.1 implemented 11 commands intrinsically. Given the differences, it does not appear that this code was copied.
In the third example, the comment Select disk is found in both sets of source code. In the DOS code, the comment is in code that is outputting to a disk. In the CP/M code the comment is at code that is simply declaring a constant. Given the differences, it does not appear that this code was copied.
In the fourth example, the comment End of file can be found in both programs where a constant is set to 1 AH in both files. The DOS constant EOF looks very similar to the CP/M constant EOFILE. However, this is the ASCII Control-Z that CP/M uses to signify the end of file that we already determined that DOS also uses for compatibility. Interestingly, there is more overlap here. The EOL character in DOS is 0DH, which is the hex equivalent of the carriage return (CR) character 13 in decimal. But the carriage return character was intended to be used to signal the end of a line, so it is no surprise that both operating systems use the character. This correlation is explained by common identifier names and common algorithms, and is not an indicator of copying.
The fifth example shows the comment Print it in both sets of code. This a very common expression. Both functions are in debugger code, looping and printing characters, but the surrounding code is significantly different, performing different functions, and thus not an indicator of copying.
The matching comments and strings were examined, and none of them appeared to be correlated due to copying.
3) Matching identifiers
Some examples of matching comments and strings are shown in Appendix C. In the first example, CRLF is a label in both programs. CRLF is a common abbreviation for the carriage return/linefeed that appears at the end of a string in CP/M and DOS. The rest of the surrounding code is different, and thus not an indicator of copying.
In the second example, renam is an identifier in both programs. In DOS it is a label whereas in CP/M it is a constant. Given that it is used differently in each program, it is not an indicator of copying.
In the third example, BLKSIZ is a constant in both programs. In DOS it is equal to 512 and is used for printing I/O blocks. In CP/M it is equal to 2048 and is a disk block. Given that it is used differently in each program, it is not an indicator of copying.
In the fourth example, FLGTAB is a variable of 4 bytes in both programs. In DOS, it is the ASCII bytes for the letters t, l, s, w, and b. In CP/M it is the numbers 1, 7, 8, 3, and 5. Given that it is used differently in each program, it is not an indicator of copying.
In the fifth example, RDLOOP is a label in the code. In both program, it marks the beginning of a loop that ends in a conditional jump back to the beginning of the loop using the instruction JNZ RDLOOP. However, other than those instructions, the loops are very different. Given the differences in surrounding code in each program, it is not an indicator of copying.
In the sixth example, LSTFCB is a variable in the CP/M code while it is a constant in the DOS code.
The matching identifiers were examined, and none of them appeared to be correlated due to copying. Given this difference it is not an indicator of copying.
4) Partially matching identifiers
Appendix D shows some examples of identifiers in DOS and CP/M that partially match. This means that the identifiers have a sequence of characters in common. This can help find identifiers that have been changed to hide copying. The leftmost column shows the identifier in DOS, the middle column shows the identifier in CP/M, and the rightmost column shows the overlap.
Examining partially identifiers requires looking at the common part and finding something unusual that would indicate copying. For example, the identifiers variableOne and variable1 might seem suspicious because they are identical except that the number 1 appears in one identifier where the word “one” appears in the other. Or the identifiers ZeidmanIndex and ZeidmanCount might seem like an attempt to disguise copying. Reviewing the partially matching identifiers, I found no such signs of copying.
5) Matching instruction sequences
If code has been extensively scrubbed to hide all signs of copying, there would still be instruction sequences that matched. If the code was modified so much that all of the algorithms were changed, then what was the justification for copying? So the final test is to look for instruction sequences that match.
Appendix E gives an example of one of the very few instruction sequences that matched in DOS and CP/M. As can be seen, this is a simple jump table that is a commonly known algorithm and not a sign of copying.
2.4.2. DOS Assembly Code to CP/M PL/M Code
It is unlikely that a high-level programming language such as PL/M would be copied to low-level assembly language because it would require manual translation or compilation and disassembly of the PL/M code, which could introduce errors. However, for completeness I compared the DOS assembly code to the CP/M PL/M code.
Examples and discussions of the matching elements between DOS assembly code and CP/M PL/M code are given below.
1) Matching statements
There were few matching statements, but two examples are given in Appendix F. In both cases, routines in both programs had an identical name but the algorithms being implemented in each case were significantly different. The few statement matches are not indications of copying.
2) Matching comments/strings
There were few matching comments and strings, but two examples are given in Appendix G. The comment RUBOUT is not unusual given that ASCII delete character 7 H was also commonly called the rubout character.
In the second example, the comment get next character can be found in both sets of code. This is not an unusual comment and the surrounding code in both routines is very different.
In the third example, the comment Return current drive number can be found in both sets of code. Although this is a very uncommon phrase when searched on the Internet, as I will discuss in section 3.5.2, the surrounding code in both routines is very different.
The few comment and string matches are not indications of copying.
3) Matching identifiers
There were some matching identifiers in both sets of code, examples of which are shown in Appendix H. The abbreviation FCB means file control block, a term used by both operating systems to keep track of files, so it is not unusual to find the term PUTFCB and SETFCB in both sets of code.
More interesting, perhaps, is the use of the term SETDMA throughout both sets of code. In the CP/M code, SETDMA is the name of similar procedures in many files. In DOS, SETDMA is a constant in most files but a simple routine in the file MSDOS.ASM. Notice that while the code is very different in the two programs, the number 26 is associated with all of the SETDMA code. I will address this in the section 3.2 System Calls.
The few identifier matches are not indications of copying.
4) Partially matching identifiers
Appendix I shows some examples of identifiers in DOS and CP/M that partially match. The leftmost column shows the identifier in DOS, the middle column shows the identifier in CP/M, and the rightmost column shows the overlap. Reviewing the partially matching identifiers, I found no signs of copying.
5) Matching instruction sequences
There were no matching instruction sequences in the two sets of code.
2.5. Run SourceDetective for Identifiers, Statements, and Comments
The next step is to run SourceDetective to determine the number of times each matching code element (statements, comments and strings, and identifiers) can be found on the Internet. In a typical code comparison, this focuses attention on those elements that can be found in both programs but cannot be found, or are rarely found, on the Internet. These are much more likely to be smoking guns. In this case, however, since CP/M source code has been available online for several decades, running SourceDetective was not as helpful as it would otherwise be which is why I examined nearly all cases of matching code elements. However, the rarely found elements may still be important and are described below.
2.5.1. DOS Assembly Code to CP/M Assembly Code
Table 3 shows the number of hits for the rarest matching comments and strings in the DOS and CP/M assembly code. All the matches are fairly common and provide no signs of copying.
Table 4 shows the number of hits for the rarest matching identifiers in the DOS and CP/M assembly code. All the matches are fairly common except for the first one,
Table 3. Matching DOS and CP/M assembly code comments and strings with hits on the internet.
Table 4. Matching DOS and CP/M assembly code identifiers with hits on the internet.
lstfcb, and provide no signs of copying. The identifier lstfcb can be seen in Appendix C and was already determined not to be an indicator of copying.
Table 5 shows the number of hits for the rarest matching statements. The top of the table shows statements that are fairly rare, which could indicate copying. However, as shown in Appendix A, when the surrounding code is examined, these statements are found in very different routines in the two programs, indicating that they are not signs of copying.
2.5.2. DOS Assembly Code to CP/M PL/M Code
Table 6 shows the number of hits for the rarest matching comments and strings in the DOS assembly code and CP/M PL/M code. Only the first listed match is rare. Examining the procedures in which the comment is found, shown in Appendix G, the code is different in both programs and thus not a sign of copying.
Table 7 shows the number of hits for the rarest matching statements in the DOS assembly code and CP/M PL/M code. There are a few rare matches, as already described and already shown in Appendix F, which are not signs of copying as determined by the surrounding code. All the other matches are fairly common and provide no signs of copying.
Table 8 shows the number of hits for the rarest matching identifiers in the DOS assembly code and CP/M PL/M code. All the matches are fairly common and provide no signs of copying.
2.6. Examine Partial Identifiers
Reviewing the list of partially matching identifiers none of them stood out as unusual or indicated copying.
2.7. Run CodeCross
CodeCross compares functional code in one set of source code to nonfunctional comments in another set of source code. In many cases, when a programmer copies code, he or she will paste the original code into a file, comment it out, and begin writing new
Table 5. Matching DOS and CP/M assembly code statements with hits on the internet.
Table 6. Matching DOS assemblycode and CP/M PL/M code comments and strings with hits on the internet.
Table 7. Matching DOS assemblycode and CP/M PL/M code statements with internet hits.
Table 8. Matching DOS assemblycode and CP/M PL/M code identifiers with internet hits.
code using the old code as a guide. Code Cross finds this very strong indicator of copying.
2.7.1. DOS Assembly Code to CP/M Assembly Code
The code was compared and found to consist of one-or two-word statements that were commented out. Source Detective was run to determine whether these commented out statements were rare, and they were determined to be extremely common, as shown in Table 9.
2.7.2. DOS Assembly Code to CP/M PL/M Code
The code was compared and found to consist of one-or two-word statements that were commented out. Source Detective was also run to determine whether there commented out statements were rare, and they were determined to be extremely common, as shown in Table 10.
2.8. Comparing DOS 1.0 Binary
The DOS source code from Microsoft is for version 1.1. No source code was supplied for version 1.0, and the binary files for version 1.0 are also difficult to find. I received a copy of the DOS 1.0 binary code from Daniel B. Sedory  that appears to be valid. I
Table 9. 3.7.1. DOS and CP/M assembly code commented-out statements and internet hits.
Table 10. DOS assembly code and CP/M PL/M code commented-out statements and internet hits.
compared this version to both the DOS version 1.1 source code and to the CP/M source code using the Bit Match function of Code Suite that compares binary code to binary code or to source code.
2.8.1. Microsoft 1.0 Binary Code to Microsoft 1.1 Assembly Code
When source code is converted to binary code, much of the human-readable information is lost. Strings such as error messages are not lost, and some words also remain. The strings that were found in both versions of DOS are given in Table 11 while the words that were found in both versions of DOS are given in Table 12.
Table 11. Matching strings in DOS 1.0 binary code and DOS 1.1 source code.
Table 12. Matching words in DOS 1.0 binary code and DOS 1.1 source code.
The fact that a relatively large number of strings and words were found in both versions confirms that version 1.0 is probably a legitimate version of DOS.
2.8.2. Microsoft 1.0 Binary Code to CP/M Assembly Code
There was only on string that could be found in both programs. The words that can be found in both operating systems are common words, most of which are simple English language words. This comparison gives no indications of copying.
2.8.3. Microsoft 1.0 Binary Code to CP/M PL/M Code
Table 13. Matching strings in DOS 1.0 binary code and CP/M assembly code.
Table 14. Matching words in DOS 1.0 binary code and CP/M assembly code.
Table 15. Matching strings in DOS 1.0 binary code and CP/M PL/M code.
Table 16. Matching words in DOS 1.0 binary code and CP/M PL/M code.
The only matching strings and words are common words, most of which are simple English language words. This comparison gives no indications of copying.
3. Other Possible Copying
In addition to code, I examined whether the DOS commands were copied from CP/M and whether the DOS system calls were copied from CP/M.
The commands for DOS and CP/M are given in Table 17 along with those of OS/8, the operating system from Digital Equipment Corporation for the PDP-8 computer that was released before CP/M in 1974  .
As can be seen, there is overlap between the commands, which I will discuss in my conclusions.
3.2. System Calls
System calls are the way that a computer program requests a service from the underlying operating system. Examples of early system calls included rebooting the system, outputting text to a console or a printer, determining the amount of memory that is installed in the system, or reading/writing data from/to a hard disk.
2Based on the code comments and research into DOS and CP/M. It is possible that other system calls also use identical numbers, but the functions of the system calls are not clearly described.
The DOS source code and CP/M source code for implementing the system calls are shown in Appendix J. Programs running on DOS and CP/M used different software code to perform system calls, and the code to implement the system calls was written very differently. However, at least 22 system calls―the numbers of system calls 0 through 5, 9 through 11, 13 through 23, 25, and 26―are identical functions2. I will discuss the implications of this in my conclusions.
Here are my conclusions about copying. And because many people are interested in whether DRI could have brought a copyright lawsuit against Microsoft, I will tie in my conclusions with that possibility. Keep in mind that while I have extensive experience in copyright law, I am not a lawyer and the law is constantly changing.
4.1. Software Source Code
There is no indication of copying of software source code. The small number of correlations between DOS source code and CP/M source code can all be explained by reasons other than copying.
The command names are descriptive of the functionality, which would preclude copyrightability because only creative expression that is not descriptive or functional can be
Table 17. DOS, CP/M, and VMS commands.
copyrighted. Also, DOS commands have more in common with OS/8 commands than with CP/M commands, and even many CP/M commands appear copied from OS/8, so it would be difficult to claim that DOS copied CP/M. A claim of copyright infringement of the commands would probably not hold up.
4.3. System Calls
The DOS system calls were definitely copied from the CP/M system calls. Given the quantity of identical numbers representing identical functions, it is clear that Tim Paterson referenced the CP/M manual when writing DOS.
So the question of copyright infringement of system calls remains. While a list of numbers is not by itself creative and thus not copyrightable, a list of numbers that arbitrarily express specific functions is creative and thus copyrightable. Furthermore, DRI appears to have indicated its copyright by putting a copyright notice on the CP/M Interface Guide  that describes the system calls. Had DRI brought a copyright infringement case against Microsoft, it would have had to show that it guarded its system calls from copying.
On the other hand, Microsoft could have prevailed by showing that it was a fair use to copy the system calls. According to copyright law, fair use is determined by the following factors  :
1) The purpose and character of the use, including whether such use is for nonprofit educational purposes.
2) The nature of the copyrighted work, especially whether it benefits the public.
3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole.
4) The effect of the use upon the potential market for or value of the copyrighted work.
It is clear that the copying did not pass the first two factors. DOS was a commercial product sold at a profit and it would be hard to argue that the copying served a public benefit. Therefore to defeat a copyright infringement charge, Microsoft would have had to show that the amount of copyrighted material copied into DOS was minimal and that copying the CP/M system calls did not, by itself, cause DRI any financial harm.
It is my opinion that DRI could have brought a legitimate copyright claim against Microsoft for copying a substantial number of system calls. Furthermore it is my belief that Microsoft could have claimed a fair use defense because using the same system commands did not reduce the market for CP/M. In other words, no one bought DOS over CP/M solely because many of the system commands used the same numbers.
I further believe that had had DRI brought a copyright case against Microsoft that Microsoft would have won using the fair use argument.
5. Download Full Results and Tools
The detailed results are too extensive to be included in their entirety in this paper. The custom scripts and code comparison results can be downloaded in a zip file at http://www.ZeidmanConsulting.com/DOS_comparisons.
I would like to thank Len Shustek and John Hollar at the Computer History Museum for pointing me to the DOS code and encouraging me to do another comparison. I would also like to thank Daniel B. Sedory for providing me with a rare copy of PC DOS 1.0 binary code. I would like to thank Clement Cole for pointing me to the DEC OS/8 handbook and pointing out the similarities to CP/M commands. And I would like to thank Tom Rolander, employee number one at Digital Research, who was always happy to answer my questions.
1The USER command is actually a way for CP/M to access extrinsic commands and is not an actual intrinsic command.
 Zeidman, B. (2014) A Code Correlation Comparison of the DOS and CP/M Operating Systems. Journal of Software Engineering and Applications, 7, 513-529.