Gendermetrics.NET: a novel software for analyzing the gender representation in scientific authoring
© The Author(s). 2016
Received: 5 April 2016
Accepted: 7 September 2016
Published: 15 September 2016
Imbalances in female career promotion are believed to be strong in the field of academic science. A primary parameter to analyze gender inequalities is the gender authoring in scientific publications. Since the presently available data on gender distribution is largely limited to underpowered studies, we here develop a new approach to analyze authors’ genders in large bibliometric databases.
A SQL-Server based multiuser software suite was developed that serves as an integrative tool for analyzing bibliometric data with a special emphasis on gender and topographical analysis. The presented system allows seamless integration, inspection, modification, evaluation and visualization of bibliometric data. By providing an adaptive and almost fully automatic integration and analysis process, the inter-individual variability of analysis is kept at a low level. Depending on the scientific question, the system enables the user to perform a scientometric analysis including its visualization within a short period of time.
In summary, a new software suite for analyzing gender representations in scientific articles was established. The system is suitable for the comparative analysis of scientific structures on the level of continents, countries, cities, city regions, institutions, research fields and journals.
KeywordsGender Gender analysis Scientometric analysis Bibliometric analysis Software tool Science structure
In the media and in scientific discussion, it is not only debated that women earn less payment in equal positions, but also less frequently climb the stairs to high ranked academic positions, e.g. in the field of medicine [8, 15, 19]. This discussion is not only limited to academic medicine, but to almost every area of science, as demonstrated by news, comments, statements or studies and reviews in renowned scientific journals such as the Science, Nature, The New England Journal of Medicine or The Lancet [3, 5, 11, 16]. A special indicator of gender representation is the position of an author in the list of authors with the first authorships and the last, alternatively senior authorships being the most prominent positions; numerous studies have focused on this issue [1, 2, 7]. However, despite the availability of a variety of different sophisticated bibliometric software tools [9, 14, 17, 18], there exists to our knowledge up to now no integrative tool for the fast and large scale analysis of gender distributions in scientific articles. Instead, gender-related analysis is still done by employing cascaded single systems performing separately the different processing steps like integration of raw data, location assignment, sex determination of the authors and statistical analysis including topic related aspects, collaboration analysis and the computation of bibliometric indices, often combined with extensive manual processing steps introducing an undesirable inter-individual variability into the analysis . To close this gap we have developed the software suite Gendermetrics.NET which serves as an integrative tool for analyzing bibliometric data with a special emphasis on gender representation and the incorporation of the article-related geo-coordinates allowing for various topographical analyses. A special design criterion is the combination of both, offline and online services for integrating raw bibliometric data and the authors’ gender. In the following, we present design and key functionality of the software. Subsequently, selected core routines of the software are demonstrated. Finally, to exemplify the working process, a standard procedure of evaluating bibliometric data is outlined. This article serves as a brief guide on how to successfully design and implement a multiuser bilbiometric software tool.
The data is provided to downstream windows clients hosted within the local area network (LAN). These remote Windows clients – representing the working environment for the scientists – include all project-specific data like publications, authors, subject areas, institutions, cities and countries, related bibliometric indices, the statistical evaluation and figure plots. All project data is stored in one file including the statistical evaluation, the figures are redrawn. This allows easy sharing and caching of project data. The intrinsically compressed file has a typical size of about 250–500 bytes per article. Conceptually, the system allows the user to make client-side updates to the database in the sense of a global learning function improving subsequent integration and identification processes for all users. The consistent application of the learning function allows the complete algorithmic reintegration of formerly integrated data. To complement this database supported offline integration yet unresolved items are evaluable by generating automated queries to the integrated online web services like BingMapSearch, GoogleMaps and Google. The software architecture is a simultaneous multi-user system, i.e. a system in which users can connect and make changes to the same database.
The Windows client was written in C# using the Microsoft Visual Studio 2015 (Microsoft, Redmond, USA) extended by the powerful DevExpress WinForms-Extension (DevExpress, Glendale, California, USA). The latter improves and simplifies both inspection and visualization of bibliometric and geographic data, e.g. by providing a powerful and flexible map and chart control. The system was developed on a Lenovo T450S laptop with Windows 10 Pro and an Intel(R) Core (TM) i7-5600U CPU up to 3.2 GHz and 12 GB RAM. The laptop also hosted the SQL-Server instance within the LAN.
The windows client
The Windows client is the working environment for the scientists and divided into the four sections Data View, Plot View, Lab Log and Database View.
According to the strategic focus on geographic information the corresponding geographic location of each localizable object is indicated in a map control below. The map control – based on either BingMaps or OpenStreetMaps offering a classical map view, a satellite view or the combination of both – allows the intuitive inspection of regions at different zoom levels, the localization of cities, the read out of geo-coordinates and the (re)definition of cities. An integrated web browser supports the integration process by generating automatic search queries, e.g. regarding the gender of a given forename or the portrait of a selected author. A process-coordinating to-do-list is given to support the user during the consecutive analyzing steps. This list, which defines a standard operating procedure, is automatically updated as a function of the current process status.
By default, the integration of raw bibliometric data is done by identifying predefined, database-located data objects like city or country names in the particular raw data string. In order to improve the quality features of integration, essentially its adaptively, robustness and error tolerance, a set of dictionaries was introduced for the entity type’s city, country, journal and institution. Mathematically, a dictionary is a collection of key and value pairs in the sense of a partial function. Concerning cities and institutions this classic dictionaries were extended by context-sensitivity, i.e. the key-value-pair is bounded to a particular context, e.g. to a country or a city, respectively. By that, e.g. the city substitution is country-specific.
The most difficult and computationally expensive part of the integration process is the correct identification of the city with a publication. Specifically, given the country, the objective is to determine reliably the correct city name from a raw location string like “Wageningen Univ, Div Human Nutr, NL-6700 EV Wageningen, Netherlands” despite variant or wrong spellings. This step is essentially due to the need of retrieving the corresponding city geo-coordinate for further data processing, e.g. the revision of identical, but variant spelled authors.
To accomplish this task, a lexical (whole-word) search routine tries to identify both, predefined city names and, preferred, keys of the city dictionary in the raw string. The preference of dictionary keys stems from the fact that dictionary keys are per definition the result of a user supervised process and hence more meaningful than predefined city names. Cities and corresponding geo-coordinates are predefined in the SQL-database.
The existence of non-identifiable cities caused by a wrong or variant spelling. Under the assumption of using an unlearned data base approximately 5–10 % of all locations fall into this category. These cities are assigned to a red status (“red cities”). It is important to note, that the use of a learning dictionary reduces this problem after few integration rounds significantly.
The existence of identified, but ambiguous city names which can be caused, e.g. by the non-consideration of additional information given in the raw location string, e.g. the specification of a federal state or a zip code. Such cities are assigned to a green status (“green cities”).
In both cases a user-supervised post processing is required. Its various steps are described in the following.
Integration of red cities
To integrate cities with a red status it is recommended to first apply the BingMapSearch-Service (SOAP-Search-API, Microsoft). Specifically, starting from a given location string, multiple combinatorial queries are generated and executed. In case of a positive identification the identified city name and its geo-coordinates are returned. On the basis of this information, predefined cities located inside a particular radius around the target (typically r = 20 km) are provided and ordered by their lexical similarity (Levenshtein distance) to the target. This process iterates automatically through all red (= unidentified) cities. After this operation the user has the option to select between the suggested cities. If no corresponding predefined city equivalent exists, the user can define a new city. By this procedure about 40 % to 50 % of the previously unidentified cities are resolved.
- (ii)In order to correct a wrong or variant spelling (e.g. “Linkopin” instead of “Linkoepin”, Sweden) a rapid lexical similarity analyzer (based on the Levenshtein distance) suggested the most similar predefined city names for selection as illustrated in Fig. 8.
Finally, the integrated web browser enables the user to generate automatic queries to Google Maps. In case of a positive identification the geo-coordinates of the city – automatically extracted from the address line of the browser – are used to make predefined cities within a given distance (typically r = 20 km) available for selection. In this way, the integration of red cities can be achieved reliably.
It is important to note, that in all three cases the city dictionary is updated accordingly.
Integration of green cities
The definitive assignment of the green cities (= identified, but ambiguous city names) is similar to the identification of the red cities. Again multiple combinatorial queries – including additional local information like the federal state – are sent to the BingMapSearch-Provider. In case of a positive identification (in about 80 % of the cases) the geographically nearest city is selected from the list of possible (i.e. equally named) cities. If the chosen city is outside a defined maximum distance (typically d = 20 km) the user can define a new city. In order to enable the system to learn this decision, the selection is stored automatically in the database for subsequent city integrations.
This latter procedure is legitimated by the paradigm that in common only one city contributes significantly to scientific articles within a set of identical city names in a country (e.g. Berlin, capital town of Germany and Berlin, a small village in Schleswig-Holstein). If this rule is violated (e.g. Arlington in Virginia and Arlington in Illinois) the related federal state, zip code or any other additional local information has to be manually incorporated in the city name.
Spatial clustering of cities
Depending on the scientific question posed, it is recommended to perform a spatial clustering of all cities after the integration process. This operation is realized by implementing the DBSCAN (= Density-Based Spatial Clustering of Applications with Noise) algorithm , a density based cluster algorithm which can find arbitrarily shaped clusters in the data (e.g. a city region along a river) or clusters that are completely surrounded by a different cluster (e.g. exurbs). As a parameter the user has to define the maximum distance ε which specifies how close points should be to each other to be considered a part of a cluster (typically ε = 50 km). In this way city regions are formed, whereby the city with the highest publication count defines the name of the region whereas remaining cities are subsumed as city child’s (e.g. city region ‘Boston’ with child’s ‘Cambridge’).
As a result, the scientific relevance of a versatile and partitioned region is more adequately captured than using a pure city-based approach often underestimating the scientific impact.
The gender of an author is algorithmically determined by its forenames und differentiates between four different groups (male, female, unisex and undefined). The analysis is context-free, i.e. solely based on the forenames of the author, additional information like the related countries are not included. Conceptually, each gender of a forename is determined on the basis of data table which currently contains about 76,600 forenames defining the genders as male, female and unisex. The number of male and female names is approximately equal (Nmale = 30,544, Nfemale = 30,874, Nunisex = 15,149). In the case of multiple forenames for each forename the gender is determined. Subsequently the gender is considered as male (female) if more male (female) forenames exist. Otherwise it is marked as unisex. Undefined authors can be revised by applying a second algorithm which requests predefined webpages specialized in determining the gender by first names. Both algorithms are fully automated, thus no further user interaction is needed. Finally, the user can revise unisex and undefined authors manually by applying the integrated web browser, e.g. to generate automatic queries to Namepedia (www.namepedia.org) or to identify the author by a simple name search. In all cases the gender table is updated accordingly.
The first step within the analysis is the data acquisition which is characterized by the read out of raw bibliometric data from the Web of Science Core Collection (T = 10 min). The data is imported as a tab-delimited UTF-8 (=8-Bit Universal Character Set Transformation Format) formatted text file. Since the number of entries is limited to a maximum of 500 entries per download the whole data is distributed over several files (here N = 50).
After data acquisition the data integration process starts with uploading the raw data into a (not necessarily new) Gendermetrics.NET project (T = 1 min). In this step, the data is reunified within the project and corrected for double entries that are characterized by identical titles, authors, publication dates and journals. The subsequently automatic integration process (T = 90 min, approximately 5–10 items/s) can be interrupted at any time, saved and continued later, as required. Furthermore, due to the multi-user design different integration processes can be performed simultaneously. During the integration process an estimated finish time is given. Additionally, a rapid integration mode on the topographical level of countries instead of cities is provided, increasing the integration speed by a factor of 50.
After the successful process of integration, the results have to be revised by the user. According to their natural dependencies (e.g. a city is identified in a country) the revision order is: Countries, cities and institutions.
Normally, countries, journals and subject areas are integrated flawlessly, in rare cases an unidentified country has to been corrected manually (T = 10 min). In the next revising step, the semi-automatic integration of both, unidentified (red) and ambiguous (green) cities is done supported by diverse algorithms as documented above (T = 2 h, resp. 1 h).
After their correct integration the user has the option to store the associations between institutions and cities (T = 5 min). This step is done before both, substituting the institution abbreviations for better reading and reorganization of the institutions (up to 2 days) with the omission of institution names. To create city regions a spatial clustering of the cities can be performed, optionally (T = 5 min). In the next step a lexical based author gender analysis is performed.
Subsequently, the user has the option to revise the classification of the ‘unisex’ and ‘undefined’ authors by applying web services. The subsequent lexical and topographic check for identical authors marks the last step of the integration process. Here, authors who might be identical (i.e. same name, different or similar first names, different cities) are marked for inspection; authors who are most likely identical (i.e. same name, similar first names and shared cities) are merged automatically. Similar first names are thereby defined as potentially equal by substring containment, e.g. ‘Mark A’, ‘M A’ and ‘Antonio M’.
The following analyze process is characterized by the computation of the bibliometric indices (like publication count, citation-count, topic-specific h-index and topic-specific productivity index) for each object type and the subsequent statistical evaluation including a variety of gender-related computations which can be done with respect to institutions, cities, city regions, countries, subject-areas and journals. One key element is the combined analysis of both, relative frequency and chances of female authors to be a author at a given authorship rank. Finally, all results are illustrated by canonical data plots and are exportable for further analysis and visualization.
The software suite Gendermetrics.NET is a simultaneous multi-user tool to allow seamless integration, inspection, modification, evaluation and visualization of bibliometric data. It extends classical approaches by offering gender and location related analysis tools. By providing an adaptive and almost fully automatic integration and analyze process, the inter-individual variability of analysis is kept at a low level. Gendermetrics.NET is intended for scientists who are interested in bibliometric analysis with a special emphasis on a topographical gender analysis. Depending on the scientific question, it enables the user to perform a scientometric analysis including its visualization within a short period of time.
The server-client architecture reflects the idea of data organization which is characterized by the separation of global and project-specific data. Decentralized (client-side) updates of the database are used to improve subsequent integration, resp. classification processes in the sense of a global learning function. By default, a simultaneous multi-user mode is supported as well as a stand-alone mode after successful data integration.
One key paradigm of analysis is the consideration of the geo-coordinates of each involved city. By this feature, various topographical analyses that are beyond the mere country, state or city level can be realized. Furthermore, the system offers the option to cluster cities spatially by their coordinates using a fast, robust and unsupervised algorithm (DBSCAN, section). This operation results in city regions representing a new spatial level between cities and states, resp. countries. The cluster algorithm can be used to overcome the local fragmentation of a city region which can result in an underestimation of its scientific impact in a solely city-based approach.
Algorithmically, the most challenging part of integration is the correct identification of the city (including its geo-coordinates) as mentioned above. Various methods including both, offline (database-based) and online (web search-based) methods are provided. Among the advantages of offline (database-supported) data integration there are a fast mode of operation and the possibility to update the underlying data in terms of a learning function. The major drawback is the incompleteness of data. By contrast, the major advantage of online (web-based) data integration is the access to large amounts of continuously updated and more comprehensive data. Its main disadvantages are the relative slowness of the query process and the missing capability to update the underlying database individually. Moreover, only a limited number of queries are free of charge (currently the BingMapsSearch-Provider has a limit of 50,000 queries/day, basic & not-for-profit key). For this reason the online-based identification is exclusively made for unidentifiable cities.
By construction, the database-supported city integration is context-free, i.e. additional location-specific information is not considered. For example, the raw location string “Univ Texas Arlington, Dept Bioengn, Joint Grad Program, Arlington, TX 76019 USA” refers unambiguously to Arlington in the U.S. state Texas and not to e.g. Arlington in Virginia. The reason for this critical restriction, which requires at least one further post-processing step, is the enormous complexity required to build an appropriate database. However, the capability of the system to learn the association between institution and city reduces this problem significantly.
Another disadvantage regarding the algorithmic city integration is the usage of the older BingMaps-search version (SOAP-Search-API). In contrast to the currently latest version (REST-API, used in BingMaps 2016) this search API shows a significantly lower hit rate. Unfortunately, the newer REST-API has not yet been integrated in the DevExpress library (personal communication). This will be achieved hereafter.
Apart from the usual bibliometric indices (citation & publication count, h- and productivity-index, author’s ranking)  and classical collaboration analysis , the statistical evaluation of the data includes a variety of gender-related computations which can be done institution-, city-, city region-, country-, subject-area- and journal-specific. The gender of an author is thereby determined by its forenames und differentiates between four different groups (male, female, unisex and undefined). An unisex (epicene) name is a given name that can be used regardless of the person’s sex. Frequently, names have gender connotations that differ from country to country (e.g. ‘Andrea’ in Italy an exclusively male forename, whereas in Germany used only for female persons). Despite that, the algorithmic sex determination of an author is context-free, i.e. without including additional information like the related countries. One reason for that is that due to international mobility it is not quite possible to conclude from the related country to the nationality of the author, particularly since some authors are associated to institutions of different countries.
By manual reconsideration, e.g. by using the integrated web browser and automatically generated queries (e.g. search for an image to determine precisely the author’s gender), a subsequent reclassification of ‘unisex’ or ‘undefined’ authors is possible. However, despite various technical tools this procedure is expensive and depends critically on the preciseness and persistence of the evaluator introducing an undesirable inter-individual variability into the analysis. Moreover, the success of identification depends directly on the Internet presence of the particular author (is there a photo available?). But even if a careful and complete correction is done, due to a lack of information (in about 80 % of undefined authors the first name consists of initials) or the usage of gender-neutral forenames – especially in Asian countries, the gender of many authors remains unclear. For this reason, it is plausible to skip the manual post-processing in favor of a fast (<2 min), fully algorithmic and therefore reproductive detection. For articles later than 2008 the high algorithmic detection rate of about 80 % enables the investigator to make valid propositions about the underlying population using the methods of statistical inference, particularly in comparision to previously published studies with significant lower detection rates .
The extension of the gender-database will result in even higher identification rates. In future, this situation may be tackled more efficiently by using author identification numbers that are associated with additional personal information (e.g. ORCID, UIAN) .
In summary, a new software suite for analyzing gender representations in scientific articles was established. The system is suitable for the comparative analysis of scientific structures on the level of continents, countries, cities, city regions, institutions, research fields and journals.
Availability of data and materials
MHKB and DAG initiated the study, MHBK developed the software, MHKB, DB, NS, and DAG participated in the discussion of the methodology and revision, MHBK and DAG wrote the manuscript, all authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Akhabue E, Lautenbach E. “Equal” contributions and credit. An emerging trend in the characterization of authorship. Ann Epidemiol. 2010;20(11):868–71. doi:10.1016/j.annepidem.2010.08.004.View ArticlePubMedPubMed CentralGoogle Scholar
- Bates T, Anić A, Marusić M, Marusić A. Authorship criteria and disclosure of contributions: comparison of 3 general medical journals with different author contribution forms. JAMA. 2004;292(1):86–8. doi:10.1001/jama.292.1.86.View ArticlePubMedGoogle Scholar
- Boyle PJ, Smith LK, Cooper NJ, Williams KS, O’Connor H. Gender balance. Women are funded more fairly in social science. Nature. 2015;525(7568):181–3. doi:10.1038/525181a.View ArticlePubMedGoogle Scholar
- Brüggmann D, Handl V, Klingelhöfer D, Jaque J, Groneberg DA. Congenital toxoplasmosis. An in-depth density-equalizing mapping analysis to explore its global research architecture. Parasites Vectors. 2015;8(1):292. doi:10.1186/s13071-015-1263-x.View ArticleGoogle Scholar
- Buitendijk S, Corda D, Flodström A, Holdcroft A, Hunter J, Pollitzer E, et al. Women in science and medicine. Lancet. 2011;377(9768):811. doi:10.1016/S0140-6736(11)60305-X.View ArticlePubMedGoogle Scholar
- Falagas ME. Unique author identification number in scientific databases. A suggestion. PLoS Med. 2006;3(5):e249. doi:10.1371/journal.pmed.0030249.View ArticlePubMedPubMed CentralGoogle Scholar
- Filardo, Giovanni; da Graca Briget; Sass Danielle M, Pollock Benjamin D, Smith Emma B, Martinez Melissa Ashley-Marie. Trends and comparison of female first authorship in high impact medical journals. Observational study (1994-2014). BMJ.2016:S. i847. DOI: 10.1136/bmj.i847.
- Flannery AM. Success, women, and academic surgery. Surgery. 2002;131(6):670–1.View ArticlePubMedGoogle Scholar
- Grauwin S, Jensen P. Mapping scientific institutions. Gendermetrics. 2011;89(3):943–54. doi:10.1007/s11192-011-0482-y.Google Scholar
- Larivière V, Ni C, Gingras Y, Cronin B, Sugimoto CR. Bibliometrics: global gender disparities in science. Nature. 2013;504(7479):211–3.View ArticlePubMedGoogle Scholar
- Leslie S-J, Cimpian A, Meyer M, Freeland E. Expectations of brilliance underlie gender distributions across academic disciplines. Science. 2015;347(6219):262–5. doi:10.1126/science.1261375.View ArticlePubMedGoogle Scholar
- Sander J, Ester M, Kriegel H-P, Xu X. Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Disc. 1998;2(2):169–94. doi:10.1023/A:1009745219419.View ArticleGoogle Scholar
- Schöffel N, Kirchdörfer M, Brüggmann D, Bundschuh M, Ohlendorf D, Groneberg D, Bendels M. Eine deskriptive Betrachtung der globalen Forschungslandschaft und der aktuellen Forschungsentwicklungen zum Thema Sarkoidose. Pneumologie. 2016;70(01):28–36. doi:10.1055/s-0041-109339.View ArticlePubMedGoogle Scholar
- Sillet A. Sigaps, un outil d’analyse des publications scientifiques. Soins; la revue de référence infirmière. 2015;794:58–60.PubMedGoogle Scholar
- Tesch BJ, Wood HM, Helwig AL, Nattinger AB. Promotion of women physicians in academic medicine. Glass ceiling or sticky floor? JAMA. 1995;273(13):1022–5.View ArticlePubMedGoogle Scholar
- Urry M. Science and gender: scientists must work harder on equality. Nature. 2015;528(7583):471–3. doi:10.1038/528471a.View ArticlePubMedGoogle Scholar
- van Eck, Jan N, Waltman L. Software survey. VOSviewer, a computer program for bibliometric mapping. Gendermetrics. 2010;84(2):523–38. doi:10.1007/s11192-009-0146-3.Google Scholar
- van Eck, Jan N, Waltman L. CitNetExplorer. A new software tool for analyzing and visualizing citation networks. J Informetrics. 2014;8(4):802–23. doi:10.1016/j.joi.2014.07.006.View ArticleGoogle Scholar
- Zhuge Y, Kaufman J, Simeone DM, Chen H, Velazquez OC. Is there still a glass ceiling for women in academic surgery? Ann Surg. 2011;253(4):637–43. doi:10.1097/SLA.0b013e3182111120.View ArticlePubMedGoogle Scholar