2014/12/31

Analysing 635M lines of GEDCOM

imageThe GEDCOM parser of Genealogie Online needed a rewrite. The code base had grown out of proportion, resulting in inefficient code and cumbersome maintenance.

A big difference between the start of coding the GEDCOM parser and now is the number of GEDCOM files available: nearly 7 thousand. This gave me the opportunity to do some analysis (and more testing)!

 

Analysis of versions and character sets

First of all the headers of all these GEDCOM files were examined to get a feeling about which GEDCOM grammars and character sets were used.

GEDCOM version*

Count

5.5

6.339

(undefined)

248

5.5.1

245

v.1.0.01 Beta

12

5.3

5

4.0

2

2.0

1

4

1

5.01

1

Total

6.854

* The GEDCOM version as presented in HEAD > GEDC > VERS. I did not check if content did actually conform to the presented grammar version. I did manually check the 5.3, 4, etc. versions, on first glance they seemed just GEDCOM 5.5.

image

The fact that only 3.6% of the GEDCOM files identified itself as 5.5.1 surprised me as this is regarded as the current de-facto standard.

It must be noted that a big portion of GEDCOM files where produced by Dutch family tree programs. But, as can be seen on the Used family tree programs (click on the program name to expand statistics) page on Genealogie Online, only Legacy, MacFamilyTree, Ahnenblatt, PhpGedView and RootsMagic advertise their GEDCOM with the 5.5.1 label.

For the GEDCOM parser it was clear, support 5.5 (and 5.5rev) and 5.5.1 GEDCOM files.

Character set

Count

ANSI

4.395

UTF-8

1.269

ANSEL

692

ASCII

312

(undefined)

95

WINDOWS

27

IBMPC

26

MACINTOSH

21

IBM WINDOWS

11

UNICODE

5

windows-1251

1

Total

6.854

image

The number of files claiming to be UTF-8 is funny. This is because UTF-8 was introduced in GEDCOM 5.5.1. So 1.265 files claimed to be UTF-8 and 243 files claimed to be GEDCOM 5.5.1. This puts the low 3.6% in another perspective…

Fortunately, I could re-use code from the old GEDCOM parser to correctly handle character sets and encoding (was a solid piece of code).

Note: Tim Forsythe publishes similar stats from GigaTrees, which paints a more American picture (for example: 14.4% GEDCOM 5.5.1).

 

Analysis of actual use

The old GEDCOM parser also included support for invalid GEDCOM tags and custom GEDCOM tags. Although I wrote the article GEDCOM files which don’t adhere to the GEDCOM standard shouldn’t be allowed to be called GEDCOM, for Genealogie Online I’m more forgiving. I want to present the genealogical data of my users and don’t want to bother them to much with the fact that their family tree program isn’t producing valid GEDCOM. But, which of the invalid and custom tags to support in the new GEDCOM parser?

I decided to read all the GEDCOM files and count the tag-sequence uses. This resulted in a CSV file which looks like:

INDI-BIRT-AGE,45
INDI-BIRT-AGNC,1820
INDI-BIRT-DATE-ANC,162
INDI-BIRT-DATE-NOTE,172764
INDI-BIRT-DATE-NOTE-CONT,11311
INDI-BIRT-DATE-SOUR,39752
INDI-BIRT-DATE-SOUR-DATE,15951
INDI-BIRT-DATE-SOUR-ITEM,16825
INDI-BIRT-DATE-SOUR-PAGE,486
INDI-BIRT-DATE-SOUR-ROLE,36055
INDI-BIRT-DOCTOR,1
INDI-BIRT-EMAIL,1
INDI-BIRT-FAMC,1223
INDI-BIRT-LABL,4092
INDI-BIRT-LATI,49730
INDI-BIRT-LONG,49730
INDI-BIRT-MOON,37
INDI-BIRT-NOTE,720806

Next step in the analysis was visualisation of this file. I opted for my favourite Javascript module D3.js which provides a cool collapsible tree. The result is available to all those interested on the GEDCOM tag usage page (also downloadable and e-usable under a CC-BY license).

image

The colour of the node indicates if the tag-sequence is valid under the GEDCOM 5.5 grammar (red > 83.7%) or not (grey > 16.2%). This visualisation aspect is not completely accurate as not all GEDCOM files are version 5.5 (the actual version wasn't taken into account).

These tags trees give a good picture of usage. If a invalid of custom tag is used a lot, I would look into the implementation part of the GEDCOM parser.

For fun I also made selections for the top-10 programs used by Genealogie Online users. This way, you can see which program has more or less invalid/custom tags…

For my own reference I made tag trees for GEDCOM 5.5 (which is the “2 January 1986” version, which was hindered by the fact that «NOTE_STRUCTURE» references «SOURCE_CITATION» and vice versa, thus introducing a loop!) and GEDCOM 5.5.1.

De data used for all of these tag trees is also downloadable in CSV and JSON format under a CC-BY license.

The end result, besides nice visualizations, is a lean, more robust and complete GEDCOM parser for Genealogie Online! Users will notice a better support/presentation of sources and notes, and for some programs the use of RIN for identification of persons.