Tuesday, 11 February 2020

How big is the FamilyTreeDNA database?

This guest post from Martin McDowell describes a new method for estimating the size of FTDNA's autosomal DNA database based on a clever analysis of kit number prefixes. The estimated database size based on this analysis is almost 2 million, much higher than previous estimates ...

Family Tree DNA Database Size

As Family Tree DNA traditionally does not release a figure on the size of its autosomal DNA database, I decided to look at the various kit numbering systems to see if I could come up with an assessment of database size that takes into account its predominance in some countries around the world (such as the North of Ireland).

Luckily kit numbers are consecutive at Family Tree DNA and we also know which prefixes they use ... https://isogg.org/wiki/FTDNA_kit_prefixes

New estimates for the FTDNA database size are larger than previously
Any attempt to estimate database size based solely on a comparison of matches across the various companies isn’t going to be representative due to the fact that FTDNA has a large international component that may not show up in any individual’s list of matches. They also have a different way of working out exactly what constitutes a match. The other factor that needs to be taken into consideration is that many people who have transferred from another company did not receive their full list of matches as for a period of time those testers only received matches up to a 3rd-5th cousin level. So looking at tests in the system is a much more accurate way of estimating exactly how many people they have in their database. However an additional complication which arises is that FTDNA has some people in their database who have only taken a Y-DNA or mtDNA test but luckily they do report these numbers so we can take this into account.

I have used the kit prefixes to calculate a database size that takes into account autosomal testers around the world as well as in the US market. I used the highest kit ID numbers I could find for each prefix in the North of Ireland DNA Project (n = 4629). Using this method, I found kit numbers in excess of ...
  1. 925,000 (non-prefix kits) 
  2. 84,000 (IN kits) ... International - a test kit that was ordered through the FTDNA website alone (not with other kits) that is being shipped internationally 
  3. 67,000 (MK kits) ... Multi Kit - a test ordered through the FTDNA website at the same time as several other kits, all of which are being shipped domestically 
  4. 54,000 (MI kits) ... Multi Kit International - a test ordered through the FTDNA website at the same time as several other kits, all of which are being shipped internationally 
  5. 32,000 (AM kits) ... test was ordered through Amazon.com 
  6. 27,000 (BP kits) ... Basic Packaging. Kits sent out in the basic plastic packaging rather than the colourful cardboard box 
  7. 271,000 (N kits) ... transfer from the National Geographic Genographic Project 
  8. 612,000 (B kits) ... transfer of Y-DNA or autosomal results through a lab transfer program (i.e. from AncestryDNA, 23andMe, or MyHeritage)
  9. 71,300 (all other prefixes) ... I searched the public Results Pages of a variety of haplogroup & geographic projects to try to identify the highest kit number for the remaining (19) prefixes. Those for which no kits could be found were assigned a value of zero.
    • A kits ... highest number > A2700 (in Jewish DNA project)
    • E kits ... highest number > E37900 (in Europe East project)
    • K kits ... highest number > K2400 (in Kazakhstan DNA project)
    • M kits ... highest number > M11400 (in Arab DNA project) 
    • T kits ... highest number > T1900 (in Libya DNA project)
    • U kits ... highest number > U4000 (in British Isles by county project)
    • V kits ... highest number > V7400 (in Jewish DNA project)
    • Z kits ... highest number > Z3600 (in Brazil DNA project)
    • all others ... zero
First I added up the totals for items 1-6 and 9 in the list above. I then reduced the total by 20% to take into account those who either did not take a Family Finder test (or did not migrate atDNA results from another company to an existing kit). In other words, this sum total was reduced to 80% of its value. In my experience many more than 80% of test-takers have autosomal results - probably closer to 90% - but I am taking this conservative figure of 80% in order to reduce the risk of overestimating the database size. 

I then added in transfer kits from other companies (all of which are autosomal) - these are the B kits in item 8. It is important to include transfers from other companies as their inclusion is a relevant component of the FTDNA database size. This is also the same reporting method used by other companies who accept transfers such as MyHeritage and, of course, Gedmatch.

Lastly, I added in the N kits (item 7) but I applied a more conservative reduction of 50% of its value (instead of the 20% reduction used with the items 1-6 and 9).

Thus, the actual numbers were as follows:
  • sum of items 1-6 and 9 = 1,260,300
  • 80% of above total = 1,008,240
  • plus item 8 (612,000) = 1,620,240
  • plus 50% of item 7 (271,000) = 1,755,740 (sum total)

So based on these kit numbers, and the conservative approach, my estimated total database size for January 2020 comes to 1,755,740. If a 90% figure is used instead of 80%, the total estimate would be 1,881,770. Both these estimates are a lot higher than previous estimates of the FTDNA database size.

Whilst this estimate still places Family Tree DNA below the big three, it does show its importance in the marketplace and particularly in the countries and regions where their kits make up a sizeable proportion of DNA tests taken (such as the North of Ireland).

Martin McDowell
NIFHS, Feb 2020

Martin McDowell is Project Administrator for the North of Ireland DNA Project

1 comment:

  1. Just a quick note about the Kit Prefix numbers. My kit for example is a B prefix because I had uploaded my Ancestry results, but later decided to do the FF test at FTDNA (I'm an adoptee and wanted to test with all companies). When they sent the kit and posted my results, they posted them under the original B prefix and simply replaced the Ancestry results with my FF results. Just an f.y.i.

    Additionally, since that time I've done Y37, mt, and upgraded the 37 to 67, 111, and recently Y-700, all under the same B prefix kit number.