This is a new version of the genderBR package that includes a new function: get_gender_nn(), which uses a character-level neural network to predict gender from Brazilian first names. This model can generalise to names not present in the IBGE census dataset, so it can be used as a complement to the existing functionality in the package. The release also includes some improvements, tests, and documentation updates.
get_gender_nn()is a new exported function that uses a character-level neural network to predict gender from Brazilian first names. Unlikeget_gender(), this function can generalise to names not present in the IBGE census dataset.- Added
clear_nn_cache()to manage the in-memory model cache. - Added
download_gender_model(), an internal function that handles downloading and caching the neural network model weights and vocabulary from Hugging Face. - Replaced
iconv()withchartr()for stripping accents in name cleaning. The previous approach relied oniconv(name, to = "ASCII//TRANSLIT"), which is platform-dependent and returnsNAon macOS for accented names (e.g., "joão"). Theencodingargument inget_gender,get_gender_nn, andmap_genderis now deprecated and will be removed in a future version. - Improved test coverage for the new function and edge cases.
- Added
torchtoImports;luzandhttr2toSuggests.
- Added support for IBGE's 2022 census data API, updating the default year to 2022 in
get_gender. - Internal dataset
nomesnow includes probabilities for 2010 and 2022 (prob_fem10,prob_fem22) and is used wheninternal = TRUE. This data covers 141,742 unique Brazilian first names. - Replaced all uses of
%>%with the base|>operator, thus removing themagrittrdependency (requires R 4.1.0 or higher). - Switched data manipulation backend to
data.tablefor faster joins and removeddplyr/tibbledependencies. - Updated tests to cover new features and changes.
- Added a section on ethical considerations in the README.
In this version, a few improvements and bug fixed were introduced. Most important, connection errors now return informative messages to users.
map_genderandget_gendernow return informative error messages when reach timeoutget_genderfunction better handles non-ASCII characters- Documentation expanded to notify users that IBGE's API does not work with UTF-8 special characters
- Magritte's pipe exported
In this minor release, the genderBR package was improved in two ways. First, bugs and some minor issues were fixed, making the package's functions more stable. Second, the package now contains an internal dataset with all the names reported by the IBGE's Census that is used by the get_gender function to predict gender from Brazilian first names. Therefore, classifying a vector with more than 1,000 names takes no more than a few seconds now. Overall, these are the improvements:
- Added a
NEWS.mdfile to track changes to the package. - Added input checks to the
get_genderfunction. - Reduce the time between requests to the IBGE's Census API.
- Fixed a problem on vectorization in the internal
round_guessfuncion. - Included an internal dataset with all Brazilian first names and their predicted gender extracted from the IBGE.
- Update the
get_genderfunction to work with internal data.