A Corpus of Digitally Neglected Texts

Nuṣūṣ is a corpus of digitized Arabic texts designed to fill gaps in extant digital corpora. Originally a collection of early Sufi and Sufi-adjacent texts, nuṣūṣ has since expanded to include early works on kalām, falsafa, and Christian theology. Through this website, users can: browse text metadata, including author biographies; read these works online; and, most importantly, search the contents of texts in the corpus. The digitized versions of these texts are available to download here on nuṣūṣ for individuals to use for computational textual analysis or other interests. You can also find all the data for nuṣūṣ on the project's github.

If you are interested in learning more or potentially contributing to this project, please contact us!

Corpus Details

We initially envisoned nuṣūṣ as a database for early Sufi texts not included in other digitized corpora, but the project has since expanded to include falsafa, kalām, and Christian theological works.

To digitize these texts, we use eScriptorium, a digital palaeography framework, which uses the kraken OCR engine. For our project we use a kraken OCR model developped by the OpenITI team.

Although the goal of nuṣūṣ is to digitize and make available new texts, we have decided to include some works that have already been digitized and are extant in other corpora. Our reasoning for this to consolidate Sufi, falsafa, kalām, and Christian theological texts in one corpus and to increase the utility of our database and its search functionality. Texts imported from other corpora are marked as such in nuṣūṣ.

We have an internal list of new texts, but if you have any suggestions or recommendations for texts to be included, please contact us.

Corpus Stats

  • Texts 91
  • Authors 34
  • Tokens 1083050
  • Pages 4707