Does anyone know of a good library for mapping a person's name to his or her gender?

DatabaseLanguage Agnostic

Database Problem Overview


I am looking for a library or database that can provide guesses about whether a person is male or female based on his or her name or nickname. Something like

john => "M",
mary => "F",
alex => "A", #ambiguous

I am looking for something that supports names other than English names (such as Japanese, Indian, etc.).

Before I get another answer along the lines of "you are going to offend people by assuming their sex/gender" let me be clear, my application does not interact with anyone. It does not send emails or contact anyone in anyway. There are no users to ask. In many cases, the person in question is dead, and the only information I have is name, birth date, and date of death. The reason I want to know the sex of the individual is to make the grammar of the output nicer and to aid in possible searches that may come latter.

Database Solutions


Solution 1 - Database

gender.c is an open source C program that does a good job. It comes with data for 44568 first names from all around the world. There is good documentation and a description of the file format (basically plain text) so it should not be to difficult to read it from your own application.

Here is what the author says:

> A few words on quality of data > > The dictionary of first names has been prepared with utmost care. > For example, the Turkish, Indian and Korean names in this dictionary > have all been independently classified by several native speakers. > I also took special care to list only those names which can currently > be found. > > The lesson from this? > > Any modifications should be done very cautiously (and they must also > adhere to the sorting required by the search algorithm). > For example, knowing that "Sascha" is a boy's name in Germany, > the author never assumed the English "Sasha" to be a girl's name. > Knowing that "Jan" is a boy's name in Germany, I never assumed it to be > also a English short form of "Janet". > Another case in point is the name "Esra". This is a boy's name in > Germany, but a girl's name in Turkey.

The program calculates a probability for the name being male of female. It can do so with the name as input alone or with the name and country of origin, which gives significantly better results.

You can download it from the website of the German computer magazine c't http://www.heise.de/ct/ftp/07/17/182/">40 000 Namen. The article is in German but don't worry, all documentation is English. Here is the direct ftp link 0717-182.zip if you are not interested in the article. The zip-File contains the source code, an windows executable, the database and the documentation.

Solution 2 - Database

The gender of a name is something that cannot be inferred programmatically in the general case. You need a name database. Here is a free name database from the US Census Bureau.

EDIT: The link for the 2010 name is dead but there are working links and a libraries in the comments.

Solution 3 - Database

"I tell ya, life ain't easy for a boy named 'Sue.'"

...So, why make it any harder? If you need to know the sex, just ask... Otherwise, don't worry about it.

Solution 4 - Database

I've builded a free API that gives a probabilistic guess on the gender based on a first name. Instead of using any of the above mentioned approaches, i instead use a huge dataset of profiles from social networks to provide a probabilistic guess along with a certainty factor. It also supports optional filtering through country or language id's. It's getting better by the day as more profiles are added to the dataset.

It's free to use at http://genderize.io

ONE thing you should consider is using a tool that takes demographics into account, as naming conventions will rely heavily on this.

Example

http://api.genderize.io?name=kim
{"name":"kim","gender":"female","probability":"0.89","count":1440}

http://api.genderize.io?name=kim&country_id=dk
{"name":"kim","gender":"male","probability":"0.95","count":44,"country_id":"dk"}

Solution 5 - Database

Here are two oddball approaches that may not even work, and likely wouldn't work en masse without violating the terms of a license:

  1. Use the Facebook API (which I know virtually nothing about, it may not even be possible) to perform two searches: one for FB male users with that first name, and one for female. Use the two numbers to decide the probability of gender.

  2. Much looser but more scalable, use the Google API and search for the name plus the gender-specific pronouns, and compare the numbers. For instance, there are 592,000,000 results for searching for "Richard his" (not as a phrase), but only 179,000,000 for "Richard her".

Solution 6 - Database

Given your stated constraints, your best option is to re-phrase whatever it is you're writing to be gender-neutral unless you know what gender they want to be called in each instance.

If writing in English, remember that singular “they” is grammatically fine as a gender-neutral third-person singular pronoun.

A good example is the title of this question. As is currently:

… mapping a person's name to his or her sex?

That would be less awkward if written:

… mapping a person's name to their sex?

Solution 7 - Database

It's also poor practice to assume that users must be male or female. There are a small but significant number of "intersex" people, most of whom are heartily sick of not having a box to tick..
bignose: interesting on the "singular they". I didn't realize it had such a long history.

Solution 8 - Database

It's not a service, but a little app with a database:
http://www.codeproject.com/KB/cpp/genderizer.aspx

And this tool is in german:
http://www.faq-o-matic.net/2011/06/01/zu-einem-vornamen-das-geschlecht-finden/

And another one in VB:
http://www.vbarchiv.net/tipps/tipp_1925-geschlecht-anhand-des-vornamens-ermitteln.html

I think in combination with some "Most used firstname in 2011" lists you should be able to build something decent.

Solution 9 - Database

The python package SexMachine will do that for you. Given any first name it returns if it's male, female or unisex. It relies on the data from the gender.c program by Jorg Michael.

Solution 10 - Database

The only thing you'll get from trying to automate it is a bunch of unhappy users. From that census data:

> JAMES, JOHN, ROBERT, MICHAEL, WILLIAM, DAVID, RICHARD, CHARLES, JOSEPH, THOMAS, CHRISTOPHER, DANIEL, PAUL, MARK, DONALD, GEORGE, KENNETH, STEVEN, EDWARD, BRIAN, RONALD, ANTHONY, KEVIN, JASON, MATTHEW, GARY, TIMOTHY, JOSE, LARRY, JEFFREY, FRANK, SCOTT, ERIC, STEPHEN, ANDREW, RAYMOND, GREGORY, JOSHUA, JERRY, DENNIS, WALTER, PATRICK, PETER, HAROLD, HENRY, CARL, ARTHUR, RYAN, JOE, JUAN, JACK, ALBERT, JUSTIN, TERRY, GERALD, KEITH, SAMUEL, WILLIE, LAWRENCE, ROY, BRANDON, ADAM, FRED, BILLY, LOUIS, JEREMY, AARON, RANDY, EUGENE, CARLOS, RUSSELL, BOBBY, VICTOR, MARTIN, JESSE, SHAWN, CLARENCE, SEAN, CHRIS, JOHNNY, JIMMY, ANTONIO, TONY, LUIS, MIKE, DALE, CURTIS, NORMAN, ALLEN, GLENN, TRAVIS, LEE, MELVIN, KYLE, FRANCIS, JESUS, RAY, JOEL, EDDIE, TROY, ALEXANDER, MARIO, FRANCISCO, MICHEAL, OSCAR, JAY, ALEX, JON, RONNIE, TOMMY, LEON, LEO, WESLEY, DEAN, DAN, LEWIS, COREY, MAURICE, VERNON, ROBERTO, CLYDE, SHANE, SAM, LESTER, CHARLIE, TYLER, GENE, BRETT, ANGEL, LESLIE, CECIL, ANDRE, ELMER, GABRIEL, MITCHELL, ADRIAN, KARL, CORY, CLAUDE, JAMIE, JESSIE, CHRISTIAN, LONNIE, CODY, JULIO, KELLY, JIMMIE, JORDAN, JAIME, CASEY, JOHNNIE, SIDNEY, JULIAN, DARYL, VIRGIL, MARSHALL, PERRY, MARION, TRACY, RENE, FREDDIE, AUSTIN, JACKIE, JOEY, EVAN, DANA, DONNIE, SHANNON, ANGELO, SHAUN, LYNN, CAMERON, BLAKE, KERRY, JEAN, IRA, RUDY, BENNIE, ROBIN, LOREN, NOEL, DEVIN, KIM, GUADALUPE, CARROLL, SAMMY, MARTY, TAYLOR, ELLIS, DALLAS, LAURENCE, DREW, JODY, FRANKIE, PAT, MERLE, TERRELL, DARNELL, TOMMIE, TOBY, VAN, COURTNEY, JAN, CARY, SANTOS, AUBREY, MORGAN, LOUIE, STACY, MICAH, BILLIE, LOGAN, DEMETRIUS, ROBBIE, KENDALL, ROYCE, MICKEY, DEVON, ASHLEY, CAREY, SON, MARLIN, ALI, SAMMIE, MICHEL, RORY, KRIS, AVERY, ALEXIS, GERRY, STACEY, CARMEN, SHELBY, RICKIE, BOBBIE, OLLIE, DENNY, DION, ODELL, MARY, COLBY, HOLLIS, KIRBY, CRUZ, MERRILL, LANE, CLEO, BLAIR, NUMBERS, CLAIR, BERNIE, JOAN, DOMINIQUE, TRISTAN, JAME, GALE, LAVERNE, ALVA, STEVIE, ERIN, AUGUSTINE, YOUNG, JOHNIE, ARIEL, DUSTY, LINDSEY, TRACEY, SCOTTIE, SANDY, SYDNEY, GAIL, DORIAN, LAVERN, REFUGIO, IVORY, ANDREA, SANG, DEON, CAROL, YONG, BERRY, TRINIDAD, SHIRLEY, MARIA, CHANG, ROSARIO, DANNIE, FRANCES, THANH, CONNIE, TORY, LUPE, DEE, SUNG, CHI, QUINN, MINH, THEO, LOU, CHUNG, VALENTINE, JAMEY, WHITNEY, SOL, CHONG, PARIS, OTHA, LACY, DONG, ANTONIA, KELLEY, CARROL, SHAYNE, VAL, JUDE, BRITT, HONG, LEIGH, GAYLE, JAE, NICKY, LESLEY, MAN, KASEY, JEWELL, PATRICIA, LAUREN, ELISHA, MICHAL, LINDSAY, and JEWEL

are all names that work for both males and females. If a girl's name is Robert and everyone, including your software, keeps on calling her a man, she'd be rather pissed.

Solution 11 - Database

Although databases are probably the most practical solution, if you want to have some fun maybe you could try writing a neural net (or using a neural net library) that takes in the name and outputs one of those 3 options (F,M,A).

You could train it using the datasets that exist in the databases suggested by other answers, as well as with any other data you have.

This solution would allow you to handle names not specifically categorised previously, and also handle different languages. You might want to pass the language (if you know it) as an input to the neural net as well.

I don't know that I can say neural nets (or any other machine learning) would do a good job of categorising though.

Solution 12 - Database

It's culture/region dependent: take Andrea, for Italians is only masculine, for Sweden is a female name while Andreas is for men; Shawn is ambiguous in English. If a language has declination, like Latin or Russian, the final letters will change according to grammatical rules,

Another source of ambiguities is Family names identical to Personal names.

In my opinion it's impossibile to solve in general.

Solution 13 - Database

The idea will clearly not work in most languages.

However if you could tell the nationality beforehand you could have more luck. In most Slav languages (e.g. russian, polish, bulgarian) you could safely assume that all surnames ending with -va -cha -ska (-a in general are feminine) while -v -ch -shi are masculine.

In fact any surname has feminine and masculine form depending on the ending. The same names used in other countries (e.g. US) might use only the masculine form though.

The same could be said for first names (-a -ya are feminine) but it is not 100% accurate.

But in general you would hardly get a library that is sufficiently accurate.

Solution 14 - Database

I haven't used it, but IBM has a Global Name Analytics library (for a price!) that seems pretty comprehensive.

Solution 15 - Database

The Z Directory (at vettrasoft.com) has a C-language function, works something like so:

void func()
{
    char c = z_guess_sex_byfirstname ("Lon");
    switch(c)
    {
    case 'M': std::cout << "It's a boy!\n"; break;
    case 'F': std::cout << "It's a girl!\n"; break;
    case 'B': std::cout << "this name is for both sexes\n"; break;
    case '?': std::cout << "sex unknown sorry\n"; break;
    }
}

it's database driven, the table has something like 10,000+ names I think, but you need to download and install the z directory (includes many other topo items like countries, geographical landmarks, airports, states, area codes, postal-zip codes, etc along with c++ functions and objects to access the data). However the names are very English-language oriented. The table is a work in progress and gradually updated.

Solution 16 - Database

Name-gender maps can work but in multicultural countries it's more like guessing. I can give you one example: Marian in Polish is a typical masculine name, whereas the same name in Great Britain is a female name. In the era of people immigrating all over the world, I'm not sure such database would be very accurate. Good luck!

Solution 17 - Database

Some cultures have unisex names - like mine. What do you do then? I think the answer is plain and simple - don't assume - you could cause offence. Just ask if its needed, otherwise gender neutrality.

Solution 18 - Database

Well, not anymore. IBM patented that idea a while ago.

So if you're looking for any level of flexability (something other than a list of names), you'll either have to (gasp!) ask the user, or simply pay IBM for the rights :)

In any case, such autodetection is annoying for many people who have gender-ambiguous names, or even just mean parents. Let's not make this any harder for them.

Solution 19 - Database

It's not free, but this is a nice library that I have used before:

> NetGender for .NET allows you to > quickly and easily build Name > Verification, Parsing and Gender > Determination into your custom > applications. Accurately verify > whether a particular field contains a > valid individual or company. NetGender > uses a 100,000+, ethnically diverse, > Name Dictionary in combination with an > 8,000+ Company Name Dictionary to > ensure precise gender determination.

http://www.softwarecompany.com/dotnet/netgender.htm

Solution 20 - Database

It's interesting that you say you have birth date. That could help. I've seen databases of histories of name popularity.

In the film Splash (1984), it was funny that Darryl Hannah's character chooses the name "Madison" from a Madison Avenue street sign, because obviously "Madison" is not a girl's name.

24 years later, Madison is the 4th most popular name for girl babies!


Name history from the gov't. (Check out Mary's sad decline through the last 100 years.)


When I wrote to the White House as a child, Richard Nixon (or, perhaps a secretary) responded to me with some photos of the historic place, addressed to "Miss Rhett Anderson." "Miss Rhett?" It doesn't even make sense! Can we REALLY not tell the difference between Clark Gable's Rhett (with a mustache, in Gone With The Wind!) and Vivian Lee's Scarlett? I shall never forgive him, despite Neil Young's assurance that "even Richard Nixon has got soul."

Solution 21 - Database

I'm pretty sure no such service could exist with an acceptable level of accuracy. Here are the problems which I think are insurmountable:

  • There are plenty of names which are for both men and women.
  • There's a lot of different names in this world, even if you only consider one country.
  • There is the "A Boy Named Sue" issue, raised so eloquently by Johnny Cash :-)

Solution 22 - Database

Solution 23 - Database

You can have a look at my python gender detection project https://github.com/muatik/genderizer

It tries to detect authors' genders looking their names and/or sample text(for example tweets) of them.

And it also supports mongodb, memcached for performance.

Solution 24 - Database

This is not really a programming problem - it comes down to getting a probability table.

AFAIK there are no public databases in distilled forms. You could either build this from census data, or buy the data from someone.

For example, this is someone who sells the probability table for Canada.

Solution 25 - Database

IMHO, it is a generally bad idea to determine sex from an individuals name. A lot of names are intersexual (good grief, is this even a word ?? :-), and also they may be one sex in one culture and another in another.

A few stupid examples, just a few that came to mind (from my part of the world, CE)

Vanja - female, in eastern countries from here, mostly male
Alex - intersex (short for Sandra, female, and Sandro, male)
Robin - in western cultures, can be both

In some parts of the world, a persons sex can be determined by looking at how the name ends. For example, Marija, Sandra, Ivana, Petra, Sara, Lucija, Ana - you can see that most of these female names end in "ja" or "ra". There are other examples as well.

Still, I think it's better just to ask the user for sex.

Solution 26 - Database

Got this from hacker news discussion about this

Solution 27 - Database

I know of no such service, however ..

  • you could start with a raw list of person names or
  • guess gender according to some rules (e.g. -o => male, -ela, -a => female)

In some countries (e.g. germany) the name a person can be given is limited by law - maybe there are some publications concerning that matter, which could be harvested (but I don't know of any in the moment).

Solution 28 - Database

I know of no such service. You can perhaps find the data you are looking for, however. The US government publishes data about the prevalence of names and the gender of the person they're attached to. The Social Security Administration has such a page, and the census may as well, but I haven't taken the time to look. Perhaps other world governments do similar things.

Solution 29 - Database

What I would do is make a hack which takes the name and searches it against the facebook api. Then looks at the resulting users and count how many of them are female or male. You then can return a percentage. Not so insurmountable anymore. :)

Solution 30 - Database

Just ask people, and if they are nice they will give you their 'M's or 'F's , and if they are not then give'em an 'A' .

Solution 31 - Database

I think it will be hard to find libraries or databases that covers more than one (or a limited few) cultures. For instance, here in Sweden the name "Maria" is strongly identified as a female name, while (to the best of my understanding) that connection is not at all as strong in spanish-speaking countries.

Solution 32 - Database

Dont bother attempting to determine a persons sex from an entered name. When you get it wrong your going to really piss someone off, letting users enter their sex is simple so you may as well let them enter it.

Solution 33 - Database

Plus I don't know but what about Asian language, Arabic, and/or one of the thousands of languages out there.

I mean sure I could see a database of names with the percentage that was male versus female, along with additional information such as region, country, and so forth. But ultimately in the end it would be quite a complicated and large database to be able to deal with all of the possible name in the world.

Plus how to deal with names that aren't in the database? etc....

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionChas. OwensView Question on Stackoverflow
Solution 1 - DatabaseLudwig WeinzierlView Answer on Stackoverflow
Solution 2 - DatabaseAyman HouriehView Answer on Stackoverflow
Solution 3 - DatabaseShog9View Answer on Stackoverflow
Solution 4 - DatabaseStromgrenView Answer on Stackoverflow
Solution 5 - DatabaserichardtallentView Answer on Stackoverflow
Solution 6 - DatabasebignoseView Answer on Stackoverflow
Solution 7 - DatabaseKarlView Answer on Stackoverflow
Solution 8 - DatabaseRemyView Answer on Stackoverflow
Solution 9 - Databasejm_tagarroView Answer on Stackoverflow
Solution 10 - Databasenitromaster101View Answer on Stackoverflow
Solution 11 - DatabasecheesView Answer on Stackoverflow
Solution 12 - DatabaseGiulio VianView Answer on Stackoverflow
Solution 13 - DatabaseDimitar SlavchevView Answer on Stackoverflow
Solution 14 - DatabasealtanView Answer on Stackoverflow
Solution 15 - DatabasegorthView Answer on Stackoverflow
Solution 16 - DatabaseMichal RogozinskiView Answer on Stackoverflow
Solution 17 - DatabasePreet SanghaView Answer on Stackoverflow
Solution 18 - DatabaselfaraoneView Answer on Stackoverflow
Solution 19 - DatabaseRichard WestView Answer on Stackoverflow
Solution 20 - DatabaseNosrednaView Answer on Stackoverflow
Solution 21 - DatabaseSteve McLeodView Answer on Stackoverflow
Solution 22 - DatabaseChief KakaView Answer on Stackoverflow
Solution 23 - DatabaseMuatikView Answer on Stackoverflow
Solution 24 - DatabaseUriView Answer on Stackoverflow
Solution 25 - DatabaseRookView Answer on Stackoverflow
Solution 26 - DatabaseSuryaView Answer on Stackoverflow
Solution 27 - DatabasemikuView Answer on Stackoverflow
Solution 28 - DatabasermeadorView Answer on Stackoverflow
Solution 29 - DatabaseajayjapanView Answer on Stackoverflow
Solution 30 - DatabaseAzderView Answer on Stackoverflow
Solution 31 - DatabaseFredrik MörkView Answer on Stackoverflow
Solution 32 - DatabasemP.View Answer on Stackoverflow
Solution 33 - DatabasePharaunView Answer on Stackoverflow