So Miyagawa Computational Linguistics, Digital Humanities, & Egyptology Computing Laboratory at University of Tsukuba
Main homepage: https://somiyagawa.com/ / Contact: miyagawa.so.kb at u.tsukuba.ac.jp

Intertextuality Collation Machine (ICoMa)

Computational Methods for Detecting Text Reuse in Multilingual Manuscript Traditions

Experimental Version | ISO/IEC 10646:2017 Compliant | Based on Smith-Waterman Algorithm Variants
ABSTRACT
This digital instrument implements computational algorithms for the detection and analysis of textual parallels across manuscript witnesses. The system supports Unicode blocks for Coptic (U+2C80–U+2CFF), Greek (U+0370–U+03FF, U+1F00–U+1FFF), and Latin scripts, employing n-gram analysis, Jaccard coefficients, and normalized edit distance metrics. Applications include stemmatic analysis, intertextuality studies, and dialectal variation mapping in ancient and medieval texts.

1.METHODOLOGY

The collation algorithm employs sliding window techniques with configurable n-gram sizes (2 ≤ n ≤ 20) to identify textual correspondences. Similarity metrics are calculated using: (i) token-based overlap coefficients, (ii) character-level edit distances normalized by maximum string length, and (iii) hybrid approaches accounting for orthographic variation. The tokenization module implements language-specific morphological segmentation rules, preserving diacritical marks while normalizing for comparison purposes.1

2.CORPUS SELECTION

3.DATA INPUT

4.ANALYSIS PARAMETERS

Parameter Value Description
N-gram Size (n) Token window for sequence matching
Threshold (τ) Minimum similarity coefficient (%)
Algorithm Similarity metric calculation
Script Mode Character set processing mode

5.RESULTS

5.1 Statistical Summary

Table 1. Quantitative Analysis of Textual Correspondence
Metric Value Unit
Total Alignments 0 count
Mean Similarity 0.00 %
Coverage Ratio 0.00 %
Unique N-grams 0 count
Figure 1. Morphological Alignment Matrix

Witness α

Witness β

Similarity Index Script Classification
Exact (σ = 1.0) Regular Coptic
High (σ > 0.8) Italic Greek
Medium (0.6 ≤ σ ≤ 0.8) Roman Latin
Low (0.4 ≤ σ < 0.6)
Figure 2. Similarity Matrix Visualization
Figure 3. Text Reuse Network Graph

5.2 Detected Parallel Passages

Table 2. Aligned Text Segments (Threshold τ ≥ 60%)
ID Witness α Witness β σ
No alignments detected
1 For theoretical foundations, see Jockers, M. L. (2013) Macroanalysis: Digital Methods and Literary History. University of Illinois Press; Smith, D. A., et al. (2014) "Computational Methods for Uncovering Reprinted Texts in Antebellum Newspapers." American Literary History 27.3: E1-E15.

REFERENCES

Bostock, M. (2011). D3: Data-Driven Documents. IEEE Transactions on Visualization and Computer Graphics, 17(12), 2301-2309.
ISO/IEC 10646:2017. Information technology — Universal Coded Character Set (UCS). International Organization for Standardization.
Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195-197.
Zanetti, F. (2022). Digital Codicology and Medieval Manuscripts. In Handbook of Stemmatology (pp. 456-489). De Gruyter.