Hello,
I implemented a similar project a few years ago, but it was a bit more complex. It processed thousends of textual documents with a bunch of distributed computers. Obviously I have enough experience with Information Retrieval techniques.
Based on your description, I would first automatically clean up the document (remove punctuation etc.) and then extract the pure words. These words a combined to n-grams (for n=1, .., m; with a user defined m), weighted with "term frequency - inverse document frequency" and finally the documents are compared with cosine similarity. This produces a score from 0 (not similar) to 1 (equal).
Based on the score, it is possible to identify all documents DS wich are similar to D by thresholding the score. And of course it is possible to identify the k most similar documents.
I would implement it with Python and the scikit-learn package (BSD licens).
If you have any questions, do not hesitate and send me a message.
Sincerely,
Sebastian