Computational Collation System for Multilingual Manuscript Analysis

ABSTRACT

This digital instrument implements computational algorithms for the detection and analysis of textual parallels across manuscript witnesses. The system supports Unicode blocks for Coptic (U+2C80–U+2CFF), Greek (U+0370–U+03FF, U+1F00–U+1FFF), and Latin scripts, employing n-gram analysis, Jaccard coefficients, and normalized edit distance metrics. Applications include stemmatic analysis, intertextuality studies, and dialectal variation mapping in ancient and medieval texts.

1.METHODOLOGY

The collation algorithm employs sliding window techniques with configurable n-gram sizes (2 ≤ n ≤ 20) to identify textual correspondences. Similarity metrics are calculated using: (i) token-based overlap coefficients, (ii) character-level edit distances normalized by maximum string length, and (iii) hybrid approaches accounting for orthographic variation. The tokenization module implements language-specific morphological segmentation rules, preserving diacritical marks while normalizing for comparison purposes.¹

2.CORPUS SELECTION

3.DATA INPUT

Witness α (Primary)

Witness β (Comparandum)

4.ANALYSIS PARAMETERS

Parameter	Value	Description
N-gram Size (n)		Token window for sequence matching
Threshold (τ)		Minimum similarity coefficient (%)
Algorithm		Similarity metric calculation
Script Mode		Character set processing mode

5.RESULTS

5.1 Statistical Summary

Table 1. Quantitative Analysis of Textual Correspondence
Metric	Value	Unit
Total Alignments	0	count
Mean Similarity	0.00	%
Coverage Ratio	0.00	%
Unique N-grams	0	count

Figure 1. Morphological Alignment Matrix

Witness α

Witness β

	Script Classification
Exact (σ = 1.0)	Regular	Coptic
High (σ > 0.8)	Italic	Greek
Medium (0.6 ≤ σ ≤ 0.8)	Roman	Latin
Low (0.4 ≤ σ < 0.6)	—	—

Figure 2. Similarity Matrix Visualization

Figure 3. Text Reuse Network Graph

5.2 Detected Parallel Passages

Table 2. Aligned Text Segments (Threshold τ ≥ 60%)
ID	Witness α	Witness β	σ
No alignments detected

¹ For theoretical foundations, see Jockers, M. L. (2013) Macroanalysis: Digital Methods and Literary History. University of Illinois Press; Smith, D. A., et al. (2014) "Computational Methods for Uncovering Reprinted Texts in Antebellum Newspapers." American Literary History 27.3: E1-E15.

REFERENCES

Bostock, M. (2011). D3: Data-Driven Documents. IEEE Transactions on Visualization and Computer Graphics, 17(12), 2301-2309.

ISO/IEC 10646:2017. Information technology — Universal Coded Character Set (UCS). International Organization for Standardization.

Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195-197.

Zanetti, F. (2022). Digital Codicology and Medieval Manuscripts. In Handbook of Stemmatology (pp. 456-489). De Gruyter.

Intertextuality Collation Machine (ICoMa)

Computational Methods for Detecting Text Reuse in Multilingual Manuscript Traditions