Computational Methods for Detecting Text Reuse in Multilingual Manuscript Traditions
Experimental Version | ISO/IEC 10646:2017 Compliant | Based on Smith-Waterman Algorithm Variants
ABSTRACT
This digital instrument implements computational algorithms for the detection and analysis of textual parallels across manuscript witnesses. The system supports Unicode blocks for Coptic (U+2C80–U+2CFF), Greek (U+0370–U+03FF, U+1F00–U+1FFF), and Latin scripts, employing n-gram analysis, Jaccard coefficients, and normalized edit distance metrics. Applications include stemmatic analysis, intertextuality studies, and dialectal variation mapping in ancient and medieval texts.
1.METHODOLOGY
The collation algorithm employs sliding window techniques with configurable n-gram sizes (2 ≤ n ≤ 20) to identify textual correspondences. Similarity metrics are calculated using: (i) token-based overlap coefficients, (ii) character-level edit distances normalized by maximum string length, and (iii) hybrid approaches accounting for orthographic variation. The tokenization module implements language-specific morphological segmentation rules, preserving diacritical marks while normalizing for comparison purposes.1
2.CORPUS SELECTION
3.DATA INPUT
4.ANALYSIS PARAMETERS
Parameter
Value
Description
N-gram Size (n)
Token window for sequence matching
Threshold (τ)
Minimum similarity coefficient (%)
Algorithm
Similarity metric calculation
Script Mode
Character set processing mode
5.RESULTS
5.1 Statistical Summary
Table 1. Quantitative Analysis of Textual Correspondence
Metric
Value
Unit
Total Alignments
0
count
Mean Similarity
0.00
%
Coverage Ratio
0.00
%
Unique N-grams
0
count
Figure 1. Morphological Alignment Matrix
Witness α
Witness β
Similarity Index
Script Classification
Exact (σ = 1.0)
Regular
Coptic
High (σ > 0.8)
Italic
Greek
Medium (0.6 ≤ σ ≤ 0.8)
Roman
Latin
Low (0.4 ≤ σ < 0.6)
—
—
Figure 2. Similarity Matrix Visualization
Figure 3. Text Reuse Network Graph
5.2 Detected Parallel Passages
Table 2. Aligned Text Segments (Threshold τ ≥ 60%)
ID
Witness α
Witness β
σ
No alignments detected
1 For theoretical foundations, see Jockers, M. L. (2013) Macroanalysis: Digital Methods and Literary History. University of Illinois Press; Smith, D. A., et al. (2014) "Computational Methods for Uncovering Reprinted Texts in Antebellum Newspapers." American Literary History 27.3: E1-E15.
REFERENCES
Bostock, M. (2011). D3: Data-Driven Documents. IEEE Transactions on Visualization and Computer Graphics, 17(12), 2301-2309.
ISO/IEC 10646:2017. Information technology — Universal Coded Character Set (UCS). International Organization for Standardization.
Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195-197.
Zanetti, F. (2022). Digital Codicology and Medieval Manuscripts. In Handbook of Stemmatology (pp. 456-489). De Gruyter.