Numero: A statistical framework to define multivariable subgroups in complex population-based datasets

Song Gao, Stefan Mutter, Aaron Casey, Ville Petteri Mäkinen

    Research output: Contribution to journalArticlepeer-review

    17 Citations (Scopus)

    Abstract

    Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms may not produce the best results if the dataset does not have a clustered structure. For this reason, we propose a framework (the R-library Numero) that combines the self-organizing map algorithm, permutation analysis for statistical evidence and a final expert-driven subgrouping step. We used Numero to define subgroups in two examples without an obvious clustering structure: a biomedical dataset of kidney disease and another dataset of community-level socioeconomic indicators. We benchmarked the Numero subgroupings against popular clustering algorithms (principal components, K-means and hierarchical clustering). The Numero subgroupings were more intuitive and easier to interpret without losing mathematical quality. Therefore, we expect Numero to be useful for exploratory analyses of population-based epidemiological datasets.

    Original languageEnglish
    Pages (from-to)369-374
    Number of pages6
    JournalInternational Journal of Epidemiology
    Volume48
    Issue number2
    DOIs
    Publication statusPublished or Issued - 1 Apr 2019

    Keywords

    • Data-driven subgrouping
    • Multivariable statistics
    • Population data
    • Self-organizing map

    ASJC Scopus subject areas

    • Epidemiology

    Cite this