Did you watch the volunteers indexing the 1940 Census and wish you could run a project like that for the records you care about? Then read on...
Historic documents often contain handwriting, old fonts, or other text formats that OCR software can't handle. We need humans--from volunteers to paid staff--to read the document images and transcribe what they see into databases which can be searched, analyzed, crawled, and used by researchers. Until now those efforts have required organizations either to outsource indexing to external partners or to cobble together their own off-line or on-site systems.
Our goal is to build a tool that can be used by libraries, archives, museums, historical sites, genealogy and heritage societies to run their own indexing projects, under their own control.
We'd like to invite libraries, archives, and museums; historical, genealogy, and heritage societies to participate in the project. Right now we need advice and examples of indexing projects that real organizations would like to run. This would allow us to work with an eye on real data outside the UK parish registers and English census records which have been driving our development up to the present.
What we need from you
Project definitions including:
- Sample image files (around 5 per project in the format you'd use for access copies),
- A maximal spec for the data you'd like to collect,
- A minimal set of required fields you need, and
- A description of the material and goals of the project.
These images and project specs will be published online and shared among collaborators.
We also need
- Commitment that someone from your organization could weigh in on discussions about the example projects and the tool.
Example Project Definitions
The following sample projects have been contributed already (updated April 9):
- Decatur County Marriage Licenses (by TNGenWeb)
- Raleigh News &Observer (by North Carolina State Library)
- Family Group Sheets (by North Carolina State Library)
- Copper King Mine Shareholder Lists (by Northeast Washington Genealogy Society)
- Washington Brick, Lime and Sewer Pipe Company Payroll Ledger (by Clayton/Deer Park Historical Society)
- Stevens County Personal Property Tax Assessment 1898 (by Stevens County Historical Society)
- Colville Volunteer Fire Department 1904-1914 (by Stevens County Historical Society)
- Hotel Lee Ledger (by Stevens County Historical Society)
- Stevens County Road Tax List for District No 2 1886 (by Stevens County Historical Society)
- Terra Cotta Order Book (by Stevens County Historical Society)
- Sépultures à Malte (by Geneanum.com (Généalogie Numérique))
In addition to example indexing project definitions, we need:
- Funding to continue development. Our top priority is building a tool for our funders' indexing projects at FreeREG and FreeCEN. Building features outside of the needs common to those projects will require more funds. Any donations made to FreeBMD (a charity registered in England and Wales, Number 1096940) which are designated for development will be applied to the project. Direct funding of development efforts is an alternative which may be the best way to enhance or customise the tool to support your own project.
- Code contributions and help with design and programming.
- Publicity and endorsement to spread the word about Open Source Indexing.
Many documents contain structured data, so are more usefully digitized into searchable databases than into text editions. These include:
- Muster rolls
- Account books
- Tax lists
- Obituary clippings
- Census forms
- Church registers
- School enrollments
We're basing our online indexing tool on Scribe, a tool developed by the Citizen Science Alliance from their Old Weather project and deployed by the Bodleian Library for What's the score at the Bodleian. More recently, Scribe has been customized by New York Public Library Labs for their Ensemble database of the performing arts.
We're augmenting the Scribe transcription system by adding a database that allows users to search and view records created by the indexing tool. We're also adding support for and offline/legacy transcripts imported via CSV files. Improvements to the data-entry UI and a system for reporting on indexing activity and managing volunteers will round out the effort. (See the data flow diagram.)
The entire system will be released under an Apache license. (In fact, the source code under development already is.)
Open Source Indexing is funded by FreeUKGen(a.k.a. FreeBMD), a UK-based non-profit dedicated to free access to genealogy data. Open Source Indexing is being developed for FreeREG, which offers free indexes to UK parish registers from 1538-1837, and FreeCEN, which offers free indexes to the English and Welsh Census. Ben Laurie is the FreeBMD trustee spearheading the effort and advising the project on Open Source and Open Data.
Development and database design is managed by Ben Brumfield (@benwbrum), an independent software engineer with eight years of experience building manuscript transcription tools. Support for offline indexes has been built by Kirk Dawson, PhD, who has volunteered with FreeREG since 1998.
Additional development expertise has been generously donated by Mocavo, the finest genealogy search engine on the planet.
Frequently Asked Questions
Different communities use different terms for transcribing structured data. Editors usually use "transcription", archivists often use "description", and genealogists generally use "indexing". The kinds of transcription activities that result in searchable databases range from abstracts of names and dates from entire articles (themselves not transcribed) to verbatim et literatim copies of every word in every field on a census form. Because "indexing" is the term most common among the most active volunteer communities in the English-speaking world, that's what we've chosen.
Are you going to index the project we give you or just use it for testing?
We're not ready to host indexing projects yet, so that's not why we're asking for the material. Rather, we're trying to test the flexibility of the tool we're building, which means we nead realistic examples to test with. We'll probably transcribe the same document over and over again, using different data entry templates and different user interfaces. The resulting index won't be hosted permanently.
Email Ben Brumfield at email@example.com to get invovled!