Guessing Genders

Screen Shot 2013-05-17 at 1.33.33 AM

I wanted to make a reasonable guess of gender for approximately 6,000,000 names. My first thought was to seek out a ready-made API, but the first few prices I found were astronomical, for a one-off personal project.  As such, I decided to hack something together myself.

Step 1: Get Name Frequency Data

I looked at data from both the US Census Bureau and the Social Security Administration. The SSA had better (more recent and more detailed) data, so I went with the SSA's data, available at http://www.ssa.gov/oact/babynames/limits.html.

Step 2: Create a Database

I decided to use SQLite for this one, since it's super-lightweight and doesn't require ongoing infrastructure.

sqlite3 namedb
CREATE TABLE fnames (name TEXT, gender CHAR(1), cnt INT, yob INT);

Step 3: Populate the database

The SSA distributed this as a ZIP file containing 133 CSV files in a pleasantly straightforward format, so dropping them into the SQLite database was dead easy.

#!/usr/bin/env ruby 
require 'sqlite3'

begin
  dbh = SQLite3::Database.open './namedb'
  Dir.glob('names/yob*.txt') do |fh|
    yob = fh.match(/[12][890][0-9][0-9]/).to_s
    File.open(fh).each_with_index do |l,i|
      print '.' if i % 100 == 0
      f_name, gender, cnt = l.chomp.split(',')
      dbh.execute("INSERT INTO fnames (name, gender, cnt, yob) VALUES (?, ?, ?, ?)", [f_name, gender, cnt, yob])
    end
    puts yob
  end
rescue SQLite3::Exception => e
  puts "SQLITE Exception: #{e}"
ensure
  dbh.close if dbh
end

Step 4: Create Indexes

The above created a usable database, but I had a couple million queries to do, so I needed to make sure it was reasonably well indexed. I created a simple index on the relevant columns, and used EXPLAIN to ensure that it was actually being used as expected.

CREATE INDEX ng_idx ON fnames (name, gender);

Step 5: Calculate a gender probability score

For my actual purpose, there's no benefit to adjusting the names by the probability that each cohort of women and men are still alive, so I'm just doing a straight ratio to calculate a probability that the individual is a woman.

SELECT 
    xx.n * 1.0 / pop.n AS p_woman
FROM 
    ( 
        SELECT 
            SUM(cnt) AS n 
        FROM 
            fnames 
        WHERE 
            gender='F' AND
            name=?  
    ) AS xx, 
    ( 
        SELECT 
            SUM(cnt) AS n 
        FROM 
            fnames 
        WHERE 
            name=?) AS pop;