Abstracts

Abstracts accepted for the conference are published here in program-order. All text is verbatim from authors’ submissions with only minor format editing.

Day 1

Keynote 1

Chair: Muskaan

A Peek Around the Corner: Some Thoughts About the Next Five Years

Vince Galvin, Chief Methodologist, Stats NZ

I’ll present some thoughts about some of the issues that I expect to dominate the Official Statistics aspects of Statistical practice. As well as highlighting a few issues in delivering accurate measures of things that matter, I’ll offer some more personal views about how some of the big trends will play out.

Session 1.1: Advanced Data and Analytics

Chair: Daniela Vasco

Conceptualised psycho-medical footprint for health status outcomes and the potential impacts for early detection and prevention of chronic diseases in the context of 3P medicine

Eben Afrifa-Yamoah

Early identification of people at risk of cardiometabolic diseases is a major clinical need. Such individuals can benefit from tailored treatments that can potentially reduce their risk, while bringing precision to predictive, preventive and personalised medicine (3PM). Central to preventive medicine is the new concept of suboptimal health status (SHS), which captures individuals with subclinical conditions across five subscales (fatigue, immune system, mental status, digestive system and cardiovascular system) using a 25-item psychometric-like instrument (SHSQ-25). Each of the subscales represents aspects of a person’s health status that could be explored in a disease continuum. This baseline study explores the internal structure of the SHSQ-25 and demonstrates its discriminatory power to predict optimal and suboptimal health status (SHS) and develop photogenic representations of their distinct relationship patterns via graphical LASSO modelling and multi-dimensional scaling configuration methods. Differences were observed in the structural, node placement and node distance of the synaptic networks for the optimal and suboptimal populations. A statistically significant variance in connectivity levels was noted between the optimal (58 non-zero edges) and suboptimal (43 non-zero edges) networks (p=0.024). Fatigue emerged as a prominently central subclinical condition within the suboptimal population, whilst the cardiovascular system domain had the greatest relevance for the optimal population. The contrast in connectivity levels and the divergent prominence of specific subclinical conditions across domain networks shed light on potential health distinctions.This study demonstrates the feasibility of creating dynamic visualizers of the evolutionary trends in the relationships between the domains of SHSQ-25 relative to health status outcomes.

Supersaturated design based statistical methods for variable selection in observational data - mixed data case

Tharkeshi Thauja Dharmaratne

Variable selection is a widely-used approach in observational studies for selecting appropriate variables to include in a statistical model. Numerous variable selection methods are introduced for this purpose, but no single method tends to be performing well over others. Hence, we attempted to design a new method for observational studies by modifying a supersaturated design (SSD)-based factor screening statistical method initially introduced for experimental studies. We conducted a comprehensive simulation study to assess the variable selection performance of our proposed method in observational data and compared it with several popularly used existing variable selection methods in observational studies. The simulation studies were designed by creating datasets including both numeric and binary variables with differing sample sizes and by allocating both large and small coefficients for the true predictors. The methods were evaluated using Type I and Type II error rates, model selection frequency and variable inclusion frequency (VIF). In our simulation setting, backward elimination (BE) with BIC selection criterion (BE(BIC)) consistently demonstrated better performance in accurately identifying the true predictors regardless of the coefficient sizes in large samples. However, selection of potential true predictors was challenging for all the methods when the sample size was small. The selected SSD-based statistical method showed the possibility of assuring the significance by comparing the VIF generated for a predictor by this method and the BE(AIC) method. The SSD method also showed comparable performance in terms of selecting true predictors with small coefficients to some of the commonly used modern variable selection methods. Hence, this study provides promising results for potential use of SSD-based factor screening statistical methods in observational studies for variable selection. Future research could focus on combining method BE (AIC) and the SSD-based statistical method for proper identification of the significance of a predictor, particularly in small sample data.

Some Results Relating the Construction of Frequency Squares from Cellular Automata

Vindya Nishadi Warnakulasooriyage

Frequency squares are a generalization of Latin squares and are widely employed in statistical experimental design due to their inherent flexibility. A frequency square is of type \((n;\lambda)\) if it contains \((n/\lambda)\) symbols, each of which appears \(\lambda\) times per row and \(\lambda\) times per column. In the case when \(\lambda=n\), we refer to the frequency square as trivial, and when \(\lambda=1\), it is a Latin square. Recent research has shown that bipermutive cellular automata generate Latin squares. Building on this, we aim to determine the necessary conditions for generating Latin squares from cellular automata. We prove that the square of order \(2^{d-1}\) generated from cellular automaton of diameter \(d\) and length \(2(d-1)\) over \(\mathbb{F}_2\) is a Latin square if and only if the local rule is bipermutive. Then, we construct frequency squares based on cellular automata up to diameter \(d=5\) and identify the type of these frequency squares. Further, we prove that the square of order \(2^{d-1}\) generated from cellular automata of diameter \(d\) and length \(2(d-1)\) over \(\mathbb{F}_2\) is a trivial frequency square if and only if the local rule of the cellular automata is a constant function.

Model Choices and Variance Estimation Methods for Assessing Healthcare Providers’ Performance in Treating ESKD Patients Close to Home

Solomon Woldeyohannes

In Australia, despite Aboriginal and/or Torres Strait Islander (First Nations) peoples having up to six times the incidence of end-stage kidney disease (ESKD) requiring treatment than non-Indigenous Australians, they have less than half the chance of having dialysis treatment close to home. In November 2018, Australian Government policy changed to provide funding for dialysis in very remote areas. However, practice changes occur at individual treating hospital level. Therefore, are there hospitals providing more or less equitable care?

Using patient-level data from the national ANZDATA Registry for 2005-2022, we used a random effects model to predict treatment at home. We compared the log standardised incidence ratio (log (SIR)), for First Nations and non-Indigenous patients at each hospital within the same model. A funnel plot was used to compare the hospitals’ performance in treating both First Nations and non-Indigenous patients close to home. We used Akaike information criterion (AIC) to compare modelling approaches. We also varied estimation of the variance of the log (SIR), using the delta-method, bootstrapping, and Markov Chain Monte Carlo (MCMC) approaches (over 5000 iterations).

Model output varied considerably with model choice. Using AIC, the binomial model performed best (AIC = 5400.21); logistic (AIC = 6197.17) and Poisson (AIC = 18011.64). However, since the logistic model is based on unit level data, and both the binomial and Poisson models are based on aggregate level data, we prefer the logistic model to avoid the ‘Ecological Fallacy’ bias due to grouping. Despite the MCMC approaches, found to be computationally expensive, the MCMC, however, provides robust estimates.

We demonstrate that these methods can be used to measure equity for patient-centred outcomes, both within and between service providers simultaneously. Both model choice and variance estimation method choice are critical and heavily affect the interpretation of the performance of health service providers.

Session 1.2: Mathematical Statistics

Chair: Shih Ching Fu

Identifying Team Playing Styles Across Phases of Play: A User-Specific Cluster Framework

Samuel Joseph Moffatt

Investigating team performance at a granular level within matches allows for the analysis of phase-specific team playing styles. The cluster framework described in this paper assists in identifying phase-specific team playing styles within team invasion sports that are vital for guiding match analysis. This paper develops a novel clustering framework, proposing a composite clustering assessment index for selecting the optimal feature transformation technique, clustering algorithm and number of clusters. The proposed composite index allows the integration of subject matter expert knowledge, ensuring that the resulting clusters are chosen optimally in alignment with the analysis goals. The clustering framework is applied in the context of Australian football to identify an interpretable number of clusters that represent the inherent grouping of team playing styles during match phases.

Semi-Supervised Estimation of Marginal Means: An Optimal Constrained Least Squares Approach

Dong Luo

We develop a new semi-supervised method for estimating the marginal mean of a response variable in scenarios where traditional linear models fall short, while generalized linear models (GLMs) such as logistic or Poisson regression can serve as good working models—even if they are misspecified. Our estimator is handy for handling binary or strictly positive response variables. While the approach involves fitting a GLM on the labelled data, and then supplementing with the unlabelled data, we show that the optimal asymptotic variance for the estimator of the marginal mean is obtained via a constrained least squares approach to the GLM fit, rather than via the conventional maximum likelihood. We demonstrate that our estimator is consistent and it obtains smaller variance than both the sample mean and the use of maximum likelihood whenever there is a non-zero correlation between the response and some covariate, regardless of correct specification of the GLM form. We provide simulation studies to support our theoretical findings.

Session 1.3: Diverse Applications of Data Science and Statistics

Chair: Lucy Conran

Utilising Machine Learning for Anomaly Detection and Editing in Official Statistics

Adam George Leinweber

The ABS is currently assessing the feasibility of using machine learning to identify and potentially treat anomalies in large administrative datasets for use in official statistics. This would have various benefits including efficiency, high quality targeting of anomalies, and the ability to detect emergent anomalies. In this paper a focus is placed on anomaly detection, emphasising the challenges related to use in official statistics. Three unsupervised algorithms, local outlier factor, isolation forest, and extended isolation forest, are explored along with some ensemble approaches. We aim to assess performance using visualisations and standard metrics, as well as managing issues around explainability, interpretability, maintenance, and computational demand.

Apportioning Economic Data of Multi-location Businesses

Zachary Steyn

Current administrative data sources provide economic quantities at the business level (ABN). However, to construct spatial economic indicators there is a need to allocate the economic quantities of ABNs that have multiple locations. The current method of attributing ABN quantities to locations is Equal Allocation which splits quantities equally between locations. While such a method may be appropriate for locations that do not have varying levels of economic activity, it can potentially create bias and inaccuracies by oversimplifying the complexities of business operations and the economic activities happening in each location. To address this, we have been investigating methods that can apportion ABN level data in the presence of differing location sizes, types of activities performed, and hierarchical structures. These methods focus on using administrative data including the Business Longitudinal Analysis Data Environment (BLADE), Census, and Single Touch Payroll (STP). The Average Economic Indicator method uses BLADE data to estimate the average economic metrics of comparable businesses in each location. These metrics are then assumed to be appropriate proxies for the allocation proportions. The use of STP data allows apportioning economic quantities based on the number of STP employees in the statistical areas matching known business locations. A challenge with this method is the need to match STP reported residential locations to business operating locations, which becomes a greater issue at smaller geographies. The Census Place of Work data provides the workplace address of Census respondents, which can be used to estimate the average number of employees per business for each location. These averages are used for assigning allocation proportions. Future work will attempt to link Census, STP, and BLADE sources to potentially create the best available source for validation and to also inform types of activity performed at each location.

Autovi: Automated Assessment of Residual Plots Using Computer Vision

Weihao Li

Visual assessment of residual plots is crucial for evaluating linear regression model assumptions and fit, but accurately interpreting these plots can be challenging. The ‘autovi’ package provides an automated solution in R by leveraging computer vision models. Taking a residual plot as input, ‘autovi’ estimates a distance measure that quantifies the divergence of the actual residual distribution from the reference distribution expected under correct model specification. This estimated distance enables formal statistical tests and provides a holistic approach to collectively assess different model assumptions. This talk will introduce the functionality of ‘autovi’, demonstrate its performance across diverse regression scenarios, and discuss opportunities to extend the package.

Enhancing Stability Selection to Identify Important Variables Using Bayesian Approaches

Mahdi Nouraie

Variable selection is a critical challenge in statistical analysis, as identifying the optimal subset of variables becomes exponentially complex with an increasing number of variables. Numerous methods have been developed to address this issue, focusing on improving the efficiency of searching through a large number of candidate variables. One widely recognized framework is stability selection, which employs resampling techniques to apply variable selection methods to various resamples of the data. The frequency with which a variable is selected across different resamples, termed the inclusion probability, serves as an indicator of the variable’s importance. A higher inclusion probability suggests that a variable is consistently selected across different data resamples, indicating its stability and significance.

This presentation demonstrates how to integrate Bayesian analysis into the inference of inclusion probabilities. In many real-world applications, prior knowledge about the importance of certain variables is available. The Bayesian framework allows us to incorporate this prior information to enhance our understanding of inclusion probabilities. Additionally, Bayesian methods provide credible intervals and posterior distributions for inclusion probabilities, offering a more comprehensive and informative approach compared to solely using the average selection frequency. This integration of Bayesian analysis with stability selection aims to improve scientific and informed decision-making by providing a reliable framework for variable selection that leverages prior knowledge and offers detailed probabilistic insights.

MM Algorithms for Lasso Feature Selection

Anant Mathur

The popular Lasso regression is widely used for model selection on a set of predictors. The Group Lasso regression extends the vanilla Lasso by enabling model selection on groups of predictors that exhibit a natural structure. Under some basic assumptions, both the vanilla and Group Lasso can be solved using a simple coordinate descent (CD) algorithm, which minimizes variables one at a time. The CD algorithm is the foundation of many popular Lasso feature selection implementations, including GLMnet.

Despite its simplicity, the CD algorithm converges slowly when the feature matrix is poorly conditioned. This article presents a novel iterative algorithm for both group and vanilla Lasso based on the majorize-minimize (MM) principle. Using real data and simulations, we demonstrate that the MM algorithm converges up to an order of magnitude faster than the classical CD algorithm when computed over a regularization path.

Traversing Truck Telematics: Insights from Telstras MTData using R

Andrew Grose

This talk presents an overview of MTData, an innovative platform designed for fleet management. MTData provides a wealth of operational data on vehicles, with real-time GPS tracking being a key component. This system offers continuous updates on vehicle location and status, which are crucial for enhancing operational efficiency. The intricacies of the system’s data generation and its integration with R will be explored.

Session 1.4: Computational Statistics

Chair: Daisy Evans

Air-HOLP: Adaptive Regularized Feature Screening for High Dimensional

Ibrahim Joudah

Handling high-dimensional datasets presents substantial computational challenges, particularly when the number of features far exceeds the number of observations and when features are highly correlated. A modern approach to mitigate these issues is feature screening. In this work, we build on the High-dimensional Ordinary Least-squares Projection (HOLP) feature screening method, by employing adaptive ridge regularization. We examine the impact of the ridge penalty on the Ridge-HOLP method and propose Air-HOLP, a data-adaptive advance to Ridge-HOLP where the ridge-regularization parameter is selected optimally for better feature screening performance. Air-HOLP is evaluated using simulated data and a prostate cancer genetic dataset. The empirical results demonstrate that Air-HOLP has improved performance over a large range of simulation settings.

Sequential Monte Carlo for the discretely observed multivariate Hawkes process with an application to terrorist activity modelling

Jason James Lambe

Patterns of terrorist activity in a given geographic region commonly exhibit temporal clustering. The multivariate Hawkes process (MHP) is a popular point process model of terrorism across multiple geographic regions, as it incorporates both self-excitation and mutual excitation of events from finitely many processes. When event times are precisely known, likelihood-based inference can be used to fit the MHP. However, as is typically the case with terror attack data, only the total event counts in disjoint observation windows may be observed. When the MHP is discretely observed, the likelihood function of the MHP is analytically intractable, rendering likelihood-based inference unavailable. We design an unbiased estimate of the intractable likelihood function using sequential Monte Carlo (SMC), based on a representation of the unobserved event times as latent variables in a state-space model. The unbiasedness of the SMC estimate allows for its use in place of the true likelihood in a Metropolis-Hastings algorithm, from which we construct a Markov Chain Monte Carlo sample of the distribution over the parameters of the MHP. Using simulated data, we assess the performance of our method and demonstrate that it outperforms an alternative method in the literature, based on mean squared error. The proposed estimation method is illustrated in an application to recent data on terrorist activity in Afghanistan and Pakistan.

Evaluating the cross-panel transferability of machine learning models for predicting panel nonresponse

John Collins

Nonresponse is a critical issue for data quality in panel surveys. Many researchers have demonstrated the potential of machine learning models to predict nonresponse, which would then allow survey managers to pre-emptively intervene with low-propensity participants. Typically, modelers fit their machine learning models to the panel data that has accumulated several waves and report which algorithm and variables yielded the best predictive results. However, these studies do not tell a manager of a yet-to-commence panel survey which technique is best for their own context (e.g., annual vs. quarterly waves, household vs. individual sampling). Studies have shown mixed results regarding the performance of nonresponse prediction in different panel contexts. In addition, there is considerable variation in which prediction technique (e.g., algorithm and variables) performs best across survey settings. It is thus unclear under which conditions predictive models successfully identify nonresponders and which techniques are best suited to which contexts.

To address the question of cross-panel generalizability, we compare machine learning-based nonresponse prediction across five panel surveys of the general German population: the Socio-Economic Panel (SOEP), the German Internet Panel (GIP), the GESIS Panel, the Mannheim Corona Stud (MCS), and the Family Demographic Panel (FREDA). We evaluate how differences in the design of the surveys and differences in the sample composition (e.g., average sample age and income) impact the characteristics of the best-performing machine learning model (e.g., the best algorithm, accuracy scores, and the most predictive variables). We compare which (types of) variables and algorithms are the most predictive across these contexts. We also evaluate how well techniques from one survey transfer to a different survey context. Our analysis shows the extent to which practitioners can expect the modeling techniques of one survey to generalize to their own context and the factors which might inhibit generalizability.

Maintenance Optimization for Latent Degradation Systems

Connor Maurice John Stewart-Green

Few issues within industry are as ubiquitous (widespread) as the costs involved in degradation and repair; whether upkeep of an assembly line, the changing of a light bulb, or the calibration of equipment, maintenance is constantly ongoing. Consequently, maintenance is often the largest source of costs many industries must deal with. To address this, various strategies have been developed to guide maintenance decisions, aiming to minimize costs while extending the lifespan of equipment. Common strategies include those based on a system’s age, schedule and observable conditions.

Of the relevant strategies, Condition-Based Maintenance (CBM) stands out for its cost-effectiveness, making it highly desirable in many industrial settings. However, its reliance on observable conditions significantly restricts its applicability. In real-world industrial contexts, breakdowns often occur with no prior warning, rendering CBM unreliable. Such sudden breakdowns are typically caused by the accumulation of latent damages within the system, a phenomenon known as latent degradation. Since this degradation is not directly observable, analysts rely on characteristics associated with system performance, known as markers, to infer its latent state. Past studies have attempted to explore this type of degradation through a CBM lens, utilizing a gamma process model with perfect repair actions. In this study, we build upon these efforts by incorporating imperfect repair actions into a bivariate gamma process model to minimize maintenance cost by strategically determining inspection times and preventive thresholds as key decision variables.

Careers Session 1

Chair: Muskaan

Day 2

Keynote 2

Chair: Shih Ching Fu

Solving Crime Faster with Data and Technology

Arlene Mavratsou APM, Assistant Commissioner, WA Police Force

In this presentation I will focus on leveraging data and technology to assist with crime-solving capabilities as well as ensuring community safety. It contributes to and works in partnership with current police practices to improve outcomes in identifying, locating and associating people of interest. The future of crime-solving is reliant on a collaborative model with frontline police officers, tactical analysts and data specialists working together to enable a rapid response to help solve crime faster.

Session 2.1: Bayesian Modelling

Chair: Lucy Conran

Examining the relationship between age and Plasmodium falciparum Parasite Rate (PfPR) across sub-Saharan Africa

Yuval Berman

Background: The Plasmodium falciparum Parasite Rate (PfPR) is a commonly reported metric of malaria transmission, defined as the proportion of the population carrying asexual blood-stage parasites. Although used as a proxy for transmission, in reality the transmission-PfPR relationship is strongly moderated by age: rising after birth to a plateau in late childhood, before declining as immunity develops. To be meaningful, observations of PfPR must be placed in their demographic context.

Methods: We compiled 218,635 observations of PfPR in children aged 6 to 59 months from 50 cross-sectional surveys in Africa, spanning 2010-2022. For each survey, stratified by rurality, we used Bayesian MCMC methods to fit an established two-parameter mechanistic model describing PfPR as a function of age in young children. For a select number of countries, we also parametereised an agent-based model, OpenMalaria, to interrogate the role of intervention scale-up in changing age structure of PfPR.

Results: We find empirical and modelled evidence that the age structure of malaria infection varies with geography and endemicity, and that this structure has changed through time. A biological interpretation of our model suggests the intensity of transmission is linked to the steady state (i.e. plateau) reached by the population: we recovered this relationship in rural settings, but found systematic deviations from expected behaviour in urban and low-endemicity areas. We show our setting-specific parameterisation outperformed generic algorithms 89% of the time.

Conclusion: This study provides the first comprehensive analysis of age patterns in PfPR collected in young children across malaria-endemic Africa. Our findings demonstrate that the age-PfPR relationship is context-dependent – meaning that care should be taken when comparing PfPR observations. Where possible, using setting-specific standardisation provides superior accuracy. Agent-based models provide useful tools for understanding the impact of malaria control on the age-structure of malaria infection.

Missed Adventures into Missing Data

Zoe Coffa

Missing assessment grade data is an accepted reality for most high school teachers across Australia. Ad-hoc imputation methods are often used to impute missing data without any consideration for why the data are missing in the first place due to minimal guidance and policy in this area. This research essay reframes mark estimation as a missing data problem that requires understanding of the three missing data mechanisms - MCAR, MAR and NMAR. Through a simulation study of n = 1000 simulated teacher markbooks, various imputation methods were compared for performance and ease of use. The study found that without truly knowing the missingness mechanism, a definitive imputation procedure could not be recommended but offer a few suggested approaches that could be adopted across school systems.

Gaussian mixture models via histogram-valued data

Hakiim Jamaluddin

Gaussian Mixture Models (GMMs) are extensively used in various fields due to their flexibility in modeling complex data distributions. Despite their popularity, GMMs encounter numerical challenges, particularly with large datasets. This study proposes a solution by incorporating missing data structures via auxiliary variables, which is crucial for mitigating these numerical issues.

Frequentist and Bayesian methodologies provide ways to handle GMMs with auxiliary variables. The Expectation-Maximisation (EM) algorithm represents the frequentist approach, while Bayesian techniques use Markov chain Monte Carlo (MCMC) methods. However, the computational burden of large datasets remains a significant challenge for both approaches.

To address this, we propose transforming classical data into symbolic data, such as histogram-valued data. This transformation reduces computational complexity while preserving essential statistical properties. Grouping observations into bins assumes a uniform within-group distribution, which must be reflected in the auxiliary variables’ values.

We introduce an MCMC approach tailored for GMMs using histogram-valued data with missing structures. This method effectively samples auxiliary variables while maintaining distributional assumptions. Inspired by symbolic likelihood functions for univariate random-bin histograms, our approach significantly reduces the computational complexity of GMM via MCMC from \(\mathcal{O}(nK)\) to \(\mathcal{O}(BK)\), where \(n \gg B\). Simulations show that our method achieves inference accuracy comparable to classical GMMs but at a much lower computational cost, validated through density estimation inference.

Moreover, the symbolic GMM approach demonstrates improved inference accuracy as the number of bins increases. Random-bin histograms provide more accurate inferences with fewer bins than fixed-bin histograms. Despite the disadvantage in MCMC chain mixing—where symbolic GMMs require Metropolis-Hastings sampling, leading to slower mixing—the computational efficiency of symbolic GMMs makes them a superior alternative for large datasets. This study underscores the potential of symbolic GMMs to enhance computational efficiency without compromising inference accuracy, paving the way for more efficient handling of large-scale data in statistical modeling.

Session 2.2: Career Experiences and Journeys

Chair: Daisy Evans

Playing in the Cosmic Backyard: A Statistician’s Journey into Astronomy

Shih Ching Fu

You may have heard that famous quotation from John Tukey: “The best thing about being a statistician is that you get to play in everyone’s backyard.” or another in a similar vein by David R. Brillinger: “Don’t forget that statisticians are the free-est of all scientists, they can work on anything.” Yet, a statistician at any particular time or place must choose a particular backyard or domain of science to work in. During this current season of my career, I have chosen to work in so-called astrostatistics. This resurgent field sees the revival of an ancient partnership between astronomy and statistics. Astronomers have always been great at collecting data, but now, this data collection proceeds at a rate that far outstrips their capability to process, analyse, and interpret their results without the help of statistical expertise.

In this talk, I want to persuade the journeying statistician that a cross-disciplinary career in astronomy, astrostatistics, or simply working with astronomical data is worth considering. I will briefly outline the kinds of astronomical problems, data, and community that an applied statistician might encounter, pointing out both the pros and cons of working in this re-emergent field. I shall draw mostly from my own experience as a graduate student navigating a cross-disciplinary research agenda and refer interested readers to Feigelson (2016), Feigelson et al. (2021), Siemiginowska et al. (2019), and Eadie et al. (2019).

Astronomy has always been a field of science that captures the public interest, whereas statistics (sadly) does not garner similar levels of popular appeal. Increased collaboration between our fields may yield benefits for both sides.

The Influence of Statistics Anxiety on Academic Learning

Kunj Guglani

Imagine a classroom where the mere mention of statistics sends a wave of anxiety across the room, with students grappling with feelings of tension and apprehension. This scenario is all too common in statistics education, where statistics anxiety significantly impedes learning and academic performance. This study explores the sources and manifestations of statistics anxiety, its effects on students’ cognitive processes, and the subsequent on their performance and engagement in statistics courses.

Let’s talk about Kunj, a smart student who does well in her classes until she takes her first statistics course. Even though she tries her hardest, the worry and avoidance behaviors tied to stats anxiety start to affect her thinking and how well she does overall. Jane’s story isn’t unique; it shows a common problem that needs new fixes beyond the usual mental health support.

Through the experiences of students like Kunj, this paper pinpoints the main reasons why people get anxious about statistics. This includes negative past experiences, lack of confidence in mathematical skills, and the perceived complexity of statistical concepts. Thstate paper highlights the detrimental effects of statistics anxiety on students’ learning outcomes, including decreased motivation, lower academic performance, and reduced ability to apply statistical knowledge in practical contexts. What’s more, it digs into what can happen if this anxiety isn’t dealt with over time. This can hold students back in both their studies and their jobs later on.

To mitigate the impact of statistics anxiety, the paper proposes several innovative strategies beyond traditional psychological support and counseling services. These advanced solutions include Gamified Learning modules, personalized learning paths, AI-powered personalized learning assistants and etc. The findings suggest that addressing statistics anxiety will ultimately contribute to the development of statistically proficient and confident individuals.

PhD to Postdoc: Tips for taking the next step

Melissa Middleton

Reaching the end of your PhD is a momentous time, filled with a whirlwind of emotions—exhaustion from the final push, excitement as your years of hard work come together, and a profound sense of accomplishment. However, as you look ahead and ask, “What comes next?”, the transition to a postdoc can feel both thrilling and daunting.

In this talk, I’ll discuss the key similarities and differences between PhD and postdoc positions, offering practical strategies to ease this transition. I’ll share personal insights and strategies that helped me navigate this phase, along with the challenges I encountered and how I overcame them.

Session 2.3: Biostatistics

Chair: Johnny Lo

DORMOUSE - Development of a novel perioperative respiratory risk management optimisation tool using statistical evidence

Daisy Evans

16,000-18,000 children undergo general anaesthesia at Perth Children’s Hospital each year, of whom approximately 15% experience perioperative respiratory adverse events (PRAE). Although most PRAE pass without long-term consequences, some may lead to catastrophic injuries or even death. We aim to make anaesthesia safer for children through the development and validation of a clinical decision support tool. This tool will assist anaesthetists to make appropriate perioperative patient management decisions, including the level of required care, and interventions to reduce the incidence of PRAE.

Clinical risk prediction tools estimate the risk an individual patient has of an outcome based on predictors, including patient characteristics. They can be used to assist clinical decision making and thus improve patient care. Barriers to uptake of clinical risk prediction tools into practice include lack of awareness or understanding of the tool, negative preconceptions, and lack of organisational support. Tools can also be too complicated to implement (e.g. if the required data is difficult or time-consuming to access) or lack generalisability to specific clinical contexts.

Current risk prediction tools for paediatric PRAE lack prediction accuracy and generalisability to the Australian context. Previous tools were built using logistic regression modelling; however, there are other methods for developing clinical prediction and decision support tools. Bayesian network models have the advantage that they can be easily visualised in an interpretable graphic, and possible interventions can be incorporated into their design, which is ideal for clinical decision support.

We propose the development of a risk prediction tool prioritising clinical utility and rigorous statistical methods. Different procedures (generalised linear modelling, Bayesian network analysis, and machine learning methods) will be compared for performance and interpretability. By considering alternative statistical methods to build the clinical decision support tool, we will optimise its translatability and reliability.

Causes and Consequences of White Matter Hyperintensities in the Brain

Marnie Petrucci

White matter hyperintensities (WMHs) are important markers of cerebral small vessel disease (CSVD), associated with heightened risks of stroke, cognitive decline, gait impairment, and dementia. This study explores the distinct clinical and genetic associations of two regional subtypes of WMHs—periventricular WMHs (PVWMHs) and deep WMHs (DWMHs)—using imaging data from around 67,000 participants from the UK Biobank (UKB). By conducting genome-wide association studies (GWAS) on PVWMHs and DWMHs individually, we seek to expand on previous research analysing WMHs as a single phenotype and to identify specific genetic variants potentially overlooked in earlier studies due to smaller sample sizes. Furthermore, this study investigates the relationships between PVWMHs and DWMHs, their genetic basis, and clinical outcomes, including cognitive function, stroke risk, and gait impairment. Post-GWAS analysis, including fine mapping and functional annotation of identified WMH-associated genomic regions, will enable the investigation of the underlying biological pathways. Mendelian Randomisation (MR) studies will investigate links between risk factors and regional WMH formation and their impact on stroke and dementia. This research is expected to provide crucial insights into the pathophysiology of CSVD, offering opportunities to refine risk prediction models and develop targeted interventions to mitigate the adverse health outcomes associated with WMHs. Finally, this study addresses the ever-increasing public health challenges CSVD poses in aging populations, potentially providing strategies to improve health outcomes through precision medicine approaches focused on modifiable risk factors.

Automating Statistical Methods for Detecting Problematic Data in Randomized Controlled Trials in Medical Research

Ling Shan Au

Background: Recent estimates suggest that nearly a third of published randomized controlled trials (RCTs) in medical research rely on data whose integrity is questionable, partly due to “zombie” RCTs where underlying data do not exist. Since this compromises subsequent meta-analyses and the resulting clinical guidelines, there is an urgent need for stricter data validation. This presentation provides an overview of established statistical methods for detecting problematic data in peer-reviewed papers, along with artificial intelligence (AI)-based approaches we’ve tested to automate and expedite this process. It offers biostatisticians valuable insights into how they can easily contribute to ensure the trustworthiness and authenticity of the data they handle.

Methods: We explored AI-based approaches to speed up and automate (1) the TRACT checklist for trustworthiness of RCTs; (2) Carlisle’s method; (3) Benford’s Law; and (4) the test for over-representation of even numbers to identify problematic data, focusing on the data table extraction and statistical programming analysis phases. We sourced and evaluated open-source AI data extraction tools and commercialized large language models (LLMs) to automate the three mentioned statistical methods.

Anticipated Outcomes: We tested ten different open-source AI data extraction tools and LLMs. Among these, GPT-4o, Claude 3.5 Sonnet, PDFlux, and TableGPT demonstrated promising performance and potential for further refinement to enhance data table extraction results. Additionally, we have developed statistical programs using R and Python to automate the statistical methods. This will allow us to combine and streamline these processes into a single, fully automated workflow driven by machine learning, resulting in a statistical report template that provides sufficient information for biostatisticians and researchers to assess the likelihood of data falsification in the evaluated publications. This will allow data integrity assessment as a mandatory step in the research workflow to ensure valid and reliable analytical results.

Implementing the Estimand Framework in Physiotherapy Trials

Peixuan Li

Background: Events that occur after randomisation, such as the use of rescue treatments, additional treatments or therapies (co-interventions), treatment discontinuation or disease progression are common in physiotherapy trials. These intercurrent events can affect the interpretation or the existence of measurements associated with the outcome of interest. However, they are often inadequately addressed, leading to misinterpretation of trial findings. Using the estimand framework introduced in the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) E9 (R1) Addendum, we discuss common intercurrent events in physiotherapy trials with strategies for handling them to answer different research questions.

Methods: We describe how the estimand framework can be used in physiotherapy trials to precisely define the treatment effect to be estimated using the five key attributes of an estimand: treatment conditions, population, variable, summary measure and intercurrent event handling, and demonstrate the application of the estimand framework using a physiotherapy trial in knee osteoarthritis.

Anticipated Outcomes: Implementing the estimand framework in physiotherapy trials will help align the design, conduct, analysis and interpretation of results with the trial objectives. This approach enhances the clarity of what is being estimated and its clinical relevance, leading to trials accurately answering the clinical questions of interest.

Comparison of methods for estimating treatment efficacy under incomplete compliance data

Cameron James Patrick

Background: Randomised controlled trials (RCTs) often target a treatment policy (or intention-to-treat) estimand. For many interventions there is not perfect compliance with the randomised treatment assignment. In these case it can also be of interest to estimate the effect of participants complying with the intervention, sometimes called the hypothetical or per-protocol estimand. The literature so far has focussed on the case where compliance is fully observed for every participant. However, it is common for RCTs to feature missing data, and often the compliance variable is not fully observed.

Methods: A simulation study will be conducted to examine the performance of common methods for handling missing data when used in conjunction with common causal inference methods targeting the hypothetical estimand, under various data generating processes and missingness mechanisms. The missing data methods considered are complete case analysis, multiple imputation of compliance as a binary variable, multiple imputation of compliance as a continuous variable, and inverse probability weighting. The estimators considered are the parametric g-formula and inverse probability of treatment weight. The commonly used, but invalid, “naive per-protocol” estimator will also be compared.

Anticipated Results: Preliminary results suggest that these methods will perform together as expected individually: complete case analysis showing bias under Missing At Random (MAR), which is corrected using multiple imputation. Standardisation-based (parametric g-formula) estimates may have better precision than weighting-based estimates. Care may need to be given if participants can have partial compliance, as commonly-used methods target different estimands in this case.

Session 2.4: Official Statistics

Chair: Trent Piccicacco

Modelling detailed food and non-alcoholic beverages expenditures: investigating possible methodologies for future expenditure surveys

Matthew Xu

Household expenditure surveys provide valuable microdata for estimating spending patterns, CPI weighting, and producing key indicators of Australians’ living standards and economic wellbeing. However, these surveys often require detailed household expenditure diaries which can be burdensome for respondents. In this investigation, we develop a prototype methodology that can provide modelled 8-digit Household Expenditure Classification (HEC) expenditures should future Household Expenditure Surveys (HES) cease collecting these details.

Taking patterns from historical HES and Survey of Income and Housing (SIH) data, we establish a modelling approach for the case of food and non-alcoholic beverages expenditures. We employ a combination of logistic regression, linear regression, Monte Carlo methods, as well as deflation and normalisation techniques to produce simulated micro datasets at the 8-digit HEC level. To assess its robustness, we perform backtesting on historical HES and SIH data, comparing our modelled estimates with actual survey data. While our focus was on food and non-alcoholic beverages, this methodology is adaptable to other household expenditure categories and can be used in other scenarios that require imputation based on historical data.

Should the HES maintain the collection of detailed 8-digit HEC data through diaries, our prototype methodology could still be valuable for enhancing the quality of survey collections, particularly for lower-quality reported household expenditures.

Managing Workload Challenges in Household Surveys: Strategies for Effective PSU Reformation and Overlap Control

Muskaan

In household surveys, regions are typically divided into Primary Sampling Units (PSUs), each containing a predetermined number of dwellings to be surveyed. Following each census, PSUs are restructured to accommodate housing growth. However, rapid and disproportionate increases in the number of dwellings can occur before reformation, leading to an overwhelming rise in the number of houses requiring surveys. This surge in workload poses significant challenges for the collections team, particularly when growth exceeds manageable limits.

This presentation outlines the context of these challenges and present our work plan for addressing them. We focus on strategies for managing the collections team’s workload while ensuring continuity with previous survey quarters. Additionally, we discuss our approach to overlap control during the PSU reformation process, ensuring that newly created PSUs are primed for selection and not previously used in surveys. By implementing these strategies, we aim to enhance the efficiency and effectiveness of household surveys, ultimately contributing to more reliable data collection in rapidly evolving housing environments.

Singapore’s Second National Suicide Survey

Min Han Ho

In 2024, Singapore’s second countrywide survey on suicide, following the first study conducted in 2022, captured the voices of 5,274 local citizens, covering the rich tapestry of the diverse, multi-racial country. There was a need for this, given Singapore’s highest number of suicides in 20 years the year before. Serving non-profit Samaritans of Singapore, my 139 fellow statistical peers and I, led by Professor-general Rosie Ching, engaged families, friends and the public with various degrees of connection to suicide. We gathered primary data of local attitudes towards those directly, indirectly, or not at all affected by suicide and based on the many questions, built the Suicide Stigma Index for the second time in two years. Our statistical analysis showed, most alarmingly, that together with the majority who believe that raising the subject of suicide could cause a person to think about it, 8 in 10 think that when someone does talk about suicide, that person could take their life. There is an actual rise in those who believe that most suicides happen suddenly without warning and that a person dying by suicide was one who was unwilling to seek help. The silver lining is drawn from the 90% who believe that suicide can be prevented. Yet, two in three would not support someone in a crisis, with 71% citing their fear of worsening the situation with their lack of knowledge and ability to provide support.

Unlike Australia, Singapore has never had a national suicide prevention strategy, and the sentiment towards such a need in fast-paced Singapore is very strong now. With a prevailing 81% who still believe in existence of stigma associated with suicide in Singapore, we need to dismantle this with statistics to shatter the taboo and pave the way for a more compassionate and informed society.

Session 2.5: Data Management and Economic Analysis

Chair: Eben Afrifa

Automating workflows and the benefits of working as a data analyst within a clinical research team

Anwen Brooke Taplin

Since starting as a data analyst within a team of clinical and medical research staff, I have been asked to analyse data from various spreadsheets with hidden calculations, shortened variable names and non-uniform formatting. These spreadsheets were often put together by experienced clinicians or research assistants who spent hours manually transcribing data from multiple sources (e.g. medical charts, equipment monitoring reports, or test results) into a single document with little explanation as to how calculations were done or the sources of each measurement. Along with using much more time than automated data collection processes, random human errors are bound to occur during the transcribing process and are difficult to track without good documentation.

Being embedded within the research team instead of as an external data analyst or statistician means I have had the opportunity to understand and optimise manual data entry processes. This has allowed me to implement changes that have led to refinement of workflows which has saved my team hours of work and made tracking of errors much easier. Consequently, we have seen improvements in data quality and integrity.

Forming connections with members of my research team has been an integral part of my job, as it gives each member the confidence to bring their own expertise to problems. It is easy for both me and the team to get advice from each other at every stage of a project so that we end up with the best solution from both a clinical and data perspective. A large part of any applied statistician’s or data analyst’s job is understanding what a researcher wants, and this is made a lot easier when you are a part of the team.

A trend-cycle analysis of the evolutionary trajectory of Australian housing prices

Prince Osei Mensah

Time-frequency domain analysis of housing prices can provide insights into significant periodic patterns in the pricing dynamics for modelling and forecasting purposes. This study applied wavelet and information entropy analyses to examine the periodic patterns and evolution of housing prices in Australia’s eight capital cities from 1980 to 2023, using quarterly median house pricing data. Our findings revealed consistent patterns of higher variability in housing prices at high frequencies corresponding to scales up to four quarters and at different segments of the study period across all the cities, indicating that short-term price fluctuations were more significant than long-term changes. Notably, Melbourne and Darwin exhibited high price volatility in the early part of the study period between the mid-1980s and 1990. House prices in Brisbane and Perth exhibited cyclical patterns with periodicities lasting for three and half quarters between the mid-2000s and 2010 for Perth and up to seven and half quarters between early 2000 and 2010 for Brisbane, indicating that the two cities experienced recurring periods of growth and decline. Coherence analyses revealed strong dynamic lead-lag positive and negative relationships especially between the housing prices of Sydney and its parings with Melbourne, Brisbane and Canberra, suggesting the prices are interconnected but not always synchronised. These findings provide insight into the dynamic nature of the interdependencies among the housing markets in Australia’s major cities, which can aid policymakers, investors, and other stakeholders in making informed decisions related to economic forecasting, real estate investment portfolio diversification and strategic planning within these markets.

Careers Session 2

Chair: Lucy Conran

Data Science and Statistics in Industry: Career Journeys

Panelists:

  • Fiona Evans, Senior Managing Consultant Statistician, Data Analysis Australia.
  • Alex Jenkins, Director, WA Data Science Innovation Hub.
  • Natalia Kacperek, Chief Data Officer, WA Public Sector Commission.

Day 3

Keynote 3

Chair: Daniela Vasco

Highlights from a career in applied statistics

Russell Thomson

In this talk, I will describe some of the more interesting projects, from a 25 year career in applied statistics. Some projects are interesting for their methods they combine; such as random forests, structural equation modelling, mixed effects models and modern multivariate techniques. Other projects are interesting because of the problems behind them, such as automated watering systems, evaluating marine parks and childhood predictors of adverse adult health conditions. I will also talk about the intersection between statistics and my other favourite topic, music.

Session 3.1: Consulting and the Environment

Chair: Claudia Rivera

Outlier Detection in Multivariate Environmental Statistical Data

Dila Ram Bhandari

Environmental trends and conditions are outlined in environmental statistics. environment statistics aim at providing high-quality statistical information to improve knowledge of the environment, to support evidence-based policy and decision-making, and to provide information for the public, as well as for specific user groups. The sources of environmental statistics data are statistical surveys, administrative records, remote sensing and thematic mapping, monitoring systems, and scientific research. Ensuring data quality and reliability in environmental studies is contingent upon the detection of outliers in multivariate environmental statistical data. Environmental datasets are prone to outliers that can skew statistical studies and modeling results, and they frequently display complicated multivariate connections. The challenges presented by high-dimensional and correlated data structures are highlighted in this abstract, which looks at several statistical and machine-learning techniques used for outlier detection in environmental data. Modern approaches including isolation forests, local outlier factor (LOF), and one-class support vector machines (SVM) are explored alongside classical methods such as robust statistical measures, distance-based techniques, and clustering algorithms. The effectiveness of these methods is evaluated in the context of real-world environmental datasets, considering factors such as interpretability, computational efficiency, and scalability. Furthermore, the integration of domain knowledge and the adaptation of outlier detection techniques to specific environmental contexts are discussed as essential strategies for improving detection accuracy and relevance. Ultimately, the abstract highlights how important it is to have reliable outlier detection techniques to improve the accuracy and usefulness of environmental statistical analyses.

Ecological impact of climate change and heatwaves on fish resources: a review of evidence to guide the development of a scalable predictive modelling framework for adaptive management along Western Australian coastline.

Yaw Kwaafo Awuah-Mensah

Climate change and associated marine heatwaves pose significant threats to ocean biodiversity, with global temperatures rising 1°C since pre-industrial times. This critical review synthesizes current knowledge on climate change’s ecological impacts on fish resources, examining physiological effects, habitat alterations, distribution shifts, food web dynamics, and synergistic effects with other stressors. Evidence indicates continued ocean warming, projected to reach 4.8°C by 2100 under high emission scenarios (RCP8.5), increasing metabolic demands and breaching thermal limits of fish, affecting growth and reproduction. Community compositions shift due to poleward migrations, while ocean warming, and acidification exacerbate coral bleaching and coastal salinity changes promote ecosystem disruptions. Climate change’s synergistic effects with deoxygenation, pollution, and habitat destruction are expected to amplify impacts on fish resources, leading to significant socioeconomic implications and underscoring the urgent need for adaptive management strategies to ensure marine ecosystem and fisheries sustainability.

Venturing into Statistical Consulting for a Large Research Team

Fiona J McManus

“Can you estimate a sample size for our new trial?” “We’re starting recruitment next week. Can you send across a randomisation list?” “How should I analyse the data for my study?” These are some of the many requests I have received as a consultant biostatistician since graduating with a Master of Biostatistics in 2020. For several years, I have provided biostatistical support for a multidisciplinary centre renowned globally as leaders in their specialist area of research. With over 20 staff, there are numerous competing demands to juggle in a limited amount of time. This can leave little room for progressing my own career goals. However, consulting for multiple trials running simultaneously and responding to queries from researchers is part of the job description for many statistical consultants. So, in this role, it has been necessary to find a balance between providing support and growing professionally as a biostatistician.

In this presentation, I will share my experience of working in a high-performance team and the processes and strategies I have found useful in managing my responsibilities more efficiently and proactively while still finding scope to develop professionally.

I’ll present examples of how I handled the challenge of multiple trials requiring analysis at the same time, how I’ve managed repetitive statistical queries across the team, and how I turned a journal reviewers request into a learning opportunity and statistical conference presentation.

Session 3.2: Environment and Agriculture

Chair: Shamali Sujeewa Kumari Pradana Mudiyanselage

Towards an AI Agronomist: Fast and Efficient Methods for Predicting Crop Growth

Andrea Powell

Statistical emulation has been used in a range of physical and environmental disciplines to replace computationally taxing process models with efficient surrogates. In this work we present a prototype emulator that uses a hybrid statistical and deep learning approach, combining the flexibility and power of deep neural networks with a statistical approach to uncertainty quantification. The emulator is applied to the problem of forecasting wheat above ground biomass simulated by the Agricultural Production Systems sIMulator (APSIM), a state-of-the-art biophysical crop model. Widely used in agricultural research, APSIM requires the user to specify calibrated parameters to control the environmental and management settings of the crop and is driven by variable climate inputs. Generation of uncertainty estimates requires perturbation of the simulator inputs and parameters which increases the computational cost associated with forecasting. Our emulator consists of an informative dimension reduction to summarise the wheat biomass time-series as a set of parameters, a feed forward neural network to learn the relationship between exogenous meteorological and management variables and the wheat biomass summary parameters, and a sampling process to generate empirical confidence intervals for the forecasted crop growth. We observed good out-of-sample performance from the emulator, both in the forecasts of the wheat biomass time series and the coverage of the confidence intervals. Our results demonstrate the utility of our hybrid statistical and deep learning emulator and provide an alternative means of quickly and efficiently predicting simulated wheat biomass from APSIM.

Where is the best habitat for endangered black cockatoos?

Fiona Scarff

Understanding where different organisms can live is a central preoccupation of ecology, which has gained urgency with the need to set aside areas for conservation, and as climate change imposes rapid shifts on where it is possible for them to thrive. Information on the locations at which different species have been observed has become widely available, together with gridded spatial data on a wide range of environmental characteristics. Habitat suitability models are trained on these data to capture the association between an organism and its environment, and then projected to maps to identify valuable habitat now and in the future. We present habitat suitability models for the Forest Redtail and Carnaby’s black cockatoos, two iconic Western Australian species of conservation concern, based on 12 years of satellite tracking of tagged birds and a database of black cockatoo sightings. We used GLM, GAM, maximum entropy models, random forests, boosted regression trees and Bayesian additive regression trees to develop a picture of what the birds need from their environment, and how their distribution is likely to change in the future.

Representing Bushfires with Data: Bridging the Gap Between Statistics and the Real World

Jason Rennie

The analysis and models we produce in government play an import role in informing the way that the organisation operates and makes decisions. To facilitate a meaningful outcome, our work needs to be accurate, explainable, and reproducible. One of the biggest challenges in modelling and making predictions around bushfires is deciding how best to represent the real-world complexities that govern bushfire behaviour in a way that is also practical to model.

An example of this problem arose during the development of a Bushfire First Attack Model which predicts the probability of controlling a fire within a certain time or area threshold. An important factor to consider when fighting a bushfire is traversability of the fire ground. Traversability describes how easy it is for the firefighting crew and their vehicles to access and manoeuvre around the fire ground. It’s easy to understand why this factor is important, but it’s much more difficult to represent the problem numerically. Fires can occur in flat grasslands or steep, rugged mountain forests or anything in between, a factor not adequately described by simply considering the elevation of the site. Additionally, reported fire locations are not precise to the same degree as the spatial datasets used, and if a fire spreads, it will spread into surrounding land. These factors make it necessary to consider the broader area surrounding the fire ignition location rather than a single point. To bridge this gap in information, we implemented several “traversability” features with the guidance of domain experts.

This presentation will show how we implemented these features to represent traversability using existing datasets. This demonstrates the importance of choosing features based on an understanding of the system being modelled.

Impact of Intimate Partner Violence on mental health among married women in Sri Lanka: a study based on Women’s Wellbeing Survey-2019

Lakma Lakshani Gunarathne

The prevalence of intimate partner violence (IPV) against women is a public health problem with serious consequences for mental health wellbeing, particularly in low- and middle-income countries (LMICs). Sri Lanka, a nation classified as a LMIC, has a high prevalence of IPV among married women, but its impact on their mental health has not been adequately examined. The aim of this study was to examine the effects of IPV on mental health outcomes among married women in Sri Lanka, including psychological distress and suicidal ideation, while accounting for sociodemographic risk factors.

We analysed the data from 1,611 married women participating in the 2019 Sri Lankan Women’s Well-being Survey. A bivariate analysis was conducted to examine associations between mental health indicators, IPV experiences, and sociodemographic factors. Using survey-weighted logistic regression models, we evaluated the association between IPV and mental health outcomes while controlling for factors such as education levels, decision-making autonomy, and marriage age.

Findings revealed that 26% of married Sri Lankan women experienced IPV, 30% reported mental health issues, and 14% reported suicidal thoughts. Logistic regression models showed that women who experienced IPV had a significantly higher risk of poor mental health (AOR=2.88, 95% CI: 2.20-3.78) and suicidal ideation (AOR=5.84, 95% CI: 4.10-8.32) than non-victims. Education levels, decision-making autonomy, and older age at marriage have been identified as protective factors against adverse mental health outcomes.

This study highlights the adverse impact of IPV on the mental health of married women in Sri Lanka, emphasising the need for interventions targeting gender-based violence and women’s empowerment. National-level interventions and policies promoting education, women’s empowerment, decision-making autonomy, and legal assistance, along with mental health support services like counseling and trauma-informed treatment, are essential to mitigate the adverse effects of IPV on the mental health of married women in Sri Lanka, while preventing IPV.

Session 3.3: Applied Statistics

Chair: Phoebe Fitzpatrick

Clustering approaches for linking large administrative datasets

Aymon Wuolanne

Probabilistic record linkage is a framework for identifying the same individuals on different datasets by producing a set of pairwise match probabilities informed by a statistical model. However, once we have pairwise match probabilities, how do we group the original records into clusters that (hopefully) refer to the same individual?

Pairwise match probabilities between records can be considered as the edges of a weighted graph. The simplest approach to forming clusters is to keep all edges of the graph where the match probability is above a given threshold, then form the connected components of the graph. However, the connected components approach suffers from a tendency to link together chains of records, which are then all grouped together, even if some record pairs within the cluster have a low match probability. In some applications, we may want to assume that there are no duplicates within the input datasets. In these cases, we can restrict the set of edges so that only records which are each other’s mutual best match are linked together - this enforces a one-to-one merge. This one-to-one approach has a higher precision, but it will not be possible to identify any duplicate records within the input dataset that may be present.

We’ve been investigating a two-stage approach which incorporates both connected components clustering and one-to-one clustering, with some variations. This aims to capture the benefits of both approaches: a very high threshold for the connected components stage still allows us to identify duplicates while avoiding false links, while the second stage allows us to identify extra links that fall below the threshold.

Investigating the effect of repeated patients in surgical outcomes research

Soyoon Annie Park

Evaluating post-operative outcomes using retrospectively collected health data often involves patients with multiple eligible operations. Including all operations can violate statistical independence assumptions. A common solution is to include only the first operation.

In a retrospective audit of postoperative outcomes before and after implementing the WHO surgical safety checklist (Moore et al., 2022), we found significant variation in measured mortality rates based on how repeat patients were accounted for. Specifically, our cohort in the 18-months before had a 90-day mortality rate of 3.6% when selecting the first operations compared to 4.2% when operations were randomly selected. The same cohort in the 18-months after had a 90-day mortality of 2.8% when selecting the first operations compared to 3.4% when operations were randomly selected.

Accurate reporting and analysis of post-operative outcomes is highly important in the field of surgical and peri-operative medicine, with mortality serving as a gold-standard objective measure for assessing performance. However, in longitudinal data sets where individuals may be exposed to an operation of interest multiple times, there is no clear guidance on how to select an index event.

Major surgical outcome studies often use the first operation selection approach (e.g., Jerath et al., 2020; Liew et al., 2020), possibly to minimise bias. This approach may underestimate mortality risk, as patients undergoing frequent surgeries are less likely to die during the first contact. This effect may vary with the study period, leading to inconsistent bias.

To our knowledge, no research has examined the impact of different methods for handling repeat patients on operative mortality. This study investigates these methods using health data from the Ministry of Health and Te Whatu Ora Te Toka Tumai. Our goal is to refine methods to accurately capture the true mortality risk associated with surgical procedures.

A ruler detection method for auto-adjusting scales of shoeprint images

Zhijian Wen

Digital shoeprint comparison often requires the calibration of the image resolution so that features, such as patterns in shoeprints, can be compared on the same scale. To enable scaling, a shoeprint photograph can be taken with a forensic ruler in the same frame to obtain the pixel distance between two nearby graduations. However, manually measuring the number of pixels is a time-consuming process. Additionally, the measurement process might not be conducted accurately when the image is noisy or there is distortion in the ruler. In this study, we present an automated ruler detection method for adjusting the image scale. We show that this method can accurately estimate the image scale with a mean absolute percentage error of 3%. We also conducted automated shoeprint retrieval experiments on scale-unadjusted shoeprint images to show how the automated image scaling might be used in a common forensic process. Our results from these experiments show an increase in the retrieval performance from 0.735 to 0.929 at \(S_1\) by employing this approach to adjust the shoeprint image scales.

Advancing Dynamic Point Processes Modelling in Ecology

Shamali Sujeewa Kumari Pradana Mudiyanselage

This study develops and analyses spatial birth-death processes in continuous and discrete time that model the distribution of trees in plant ecology. These models vary from the traditional approaches of cardinality-based birth and death rates. They account for spatial structure by considering the distances between individuals within the specified interaction distance.

In these models, we propose two methods for estimating these demographic rates. We consider both a parametric and a non-parametric approach. In the former, we derive a likelihood function and obtain inferences for the parameters through maximum likelihood estimation (MLE), which will guarantee the efficiency and asymptotic properties of these MLEs. In contrast, we use a kernel intensity estimator as the non-parametric method for estimating birth and death intensities. The non-parametric method offers flexibility and robustness across various ecological contexts due to its independence from any specific parametric form. The reliability and consistency of these intensity estimators are evaluated in both continuous-time and discrete-time settings, allowing them to be applied in a wide range of settings and bringing them closer to real-world scenarios.

We intend to rigorously validate these models and methodologies through an extensive simulation study and apply them to real-world plant ecology data. Through this study, we offer a detailed framework for the spatial birth-death process analysis with the power to connect dynamic point process modelling and practical biodiversity management. Our work Looks for and enhances the benefits of studying dynamic point process modelling insights and tools that help ecologists and conservationists better manage and understand biodiversity and population dynamics under conditions of environmental change.

Investigating Students Performances in Mathematics through PISA using Machine Learning

Nur Insani

PISA (Programme for International Student Assessment) is an international assessment conducted every three years by the Organisation for Economic Cooperation and Development (OECD). It evaluates the abilities of 15-year-old students, focused on mathematics and creative thinking as the innovative area of assessment. The assessments explore how well students can apply their knowledge and skills to real-world problems and aim to provide insights into the effectiveness of education systems worldwide and inform educational policy and practice. Australia participated for the first time in PISA in 2000. This ongoing research seeks to investigate the relationships between student performance in Australia and various contextual factors, including student background, school and learning environments, and the broader education system, using machine learning methodologies. The findings are then compared with those from other education systems to enhance educational and students performance outcomes.

Simulation of conservation management strategies

Habtu Kiros Nigus

In efforts to prevent endangered species extinction, supplementation of individuals from insurance populations into endangered populations has been practised to restore genetic diversity and reduce inbreeding depression, thereby increasing the survival probability of endangered populations. However, there are concerns that, in the long term, it can instead decrease fitness and potentially increase extinction risk, hindering its wider application. Moreover, its success is more controversial if the recipient population is in danger because of infectious disease and is evolving against it through beneficial mutations. Here, we simulate using the non-Wright-Fisher model to test whether the supplemented individuals improve the population’s long-term viability or exacerbate the situation by failing to adapt to the disease and decreasing the average fitness of the wild population.

Session 3.4: Biostatistics

Chair: Hanna Choi

Continuum of Maternal Healthcare Services in Low-and Middle-Income Countries: A Multi-Level Analysis

Abdul Baten

Background: In 2020, there was one maternal death every two minutes, totalling about 800 daily, with 95% occurring in LMICs. Effective and comprehensive utilization of MHS can reduce maternal mortality and morbidity. The care includes at least four antenatal care visits by skilled providers (ANC4+SP), skilled birth attendance (SBA), and postnatal care (PNC) across prepartum, intrapartum and postpartum periods.

Objectives: To estimate the prevalence of the continuum of MHS and the magnitude of differences across LMICs and to investigate the determinants of the continuum of MHS in LMICs.

Methods: This study analyzed data from Demographic and Health Surveys of 35 LMICs on 246,272 women aged 15-49 with recent live birth five years preceding the surveys, using a multilevel model by considering individual and country-level factors.

Results: Despite 55.8% of ANC4+SP coverage, only 47.0% of mothers received SBA assistance at delivery after receiving ANC4+SP, and 36.69% received all three care (ANC4+SP, SBA, PNC), with a high countrywide variation. Women who received ANC4+SP were 2.8 times (AOR=2.83; 95% CI=2.76–2.90) more likely to utilize SBA and women who utilized both ANC4+SP and SBA were 1.6 (AOR=1.57; 95% CI=1.53–1.61) and 19.2 (AOR=19.20; 95% CI=18.61–19.80) times more likely, respectively, to utilize PNC compared to their counterparts. The continuum of MHS was significantly associated with factors such as education, wealth, residence, distance to health facilities, employment, decision-making power, and media exposure.

Conclusion: A multi-sectoral approach integrating health and non-health initiatives, including targeted interventions for women with little or no education, unemployed women, and women in rural areas, is essential. Increasing the density of health providers and adapting successful interventions from other countries will improve MHS availability and accessibility, enhancing completion rates and achieving related SDGs by 2030.

Interpret the estimand framework from a causal inference perspective

Jinghong Zeng

The estimand framework proposed by ICH in 2017 has brought fundamental changes in the pharmaceutical industry. It clearly describes how a treatment effect in a clinical question should be precisely defined and estimated, through attributes including treatments, endpoints and intercurrent events. However, ideas around the estimand framework are commonly in text, and different interpretations on this framework may exist. This article aims to interpret the estimand framework through its underlying theories, the causal inference framework based on potential outcomes. The statistical origin and formula of an estimand is given through the causal inference framework, with all attributes translated into statistical terms. How five strategies proposed by ICH to analyze intercurrent events are incorporated in the statistical formula of an estimand is described, and a new strategy to analyze intercurrent events is also suggested. The roles of target populations and analysis sets in the estimand framework are compared and discussed based on the statistical formula of an estimand. This article recommends continuing study of causal inference theories behind the estimand framework and improving the estimand framework with greater methodological comprehensibility and availability.

Sample size calculations for partially clustered clinical trials

Kylie Lange

Background: Partially clustered clinical trials are defined as trials where some observations belong to a cluster and others are independent. For example, neonatal trials may include infants from single or multiple births, and trials in orthopaedics may enrol patients with one or more joints affected by disease. The clustering in partially clustered trials should be accounted for when determining the target sample size to avoid being over or under powered. However, sample size methods have been developed for only limited partially clustered trial designs, including when clusters have a maximum size of 2. In this research, we present design effects for partially clustered trials with larger cluster sizes, and demonstrate how to use these to determine the target sample size for a variety of partially clustered trial designs.

Methods: Design effects based on generalised estimating equations with either an independence or exchangeable working correlation structure were derived algebraically for continuous and binary outcomes. We considered both cluster and individual randomisation for the clustered observations. The algebraic design effects were validated via simulation.

Results: Design effects depend on the intracluster correlation coefficient (ICC), the proportion of observations that belong to clusters of each cluster size, the method of randomisation, type of outcome, and working correlation structure. The simulation study validated the design effects in nearly all settings, with some over-estimation of the design effects for binary outcomes when individual randomisation was used and there was a high ICC.

Discussion: The presented design effects will be useful for determining the target sample size for partially clustered trials. They depend on parameters that can be feasibly estimated when planning a trial, and will ensure that such trials are appropriately powered. An online application is currently in development to facilitate sample size calculations for triallists.

Species distributions models for projecting impacts of climate change and implications of management for a data poor coastal species, barred surfperch (Amphistichus argenteus)

Michelle Marraffini

Understanding how coastal species and their habitats respond to future climates is critical to developing adaptive management frameworks that can effectively respond to climate change impacts. Sandy beach and surf zone ecosystems are on the front lines of climate change and already face significant threats from coastal squeeze, greater storm frequency and intensity, and increasing ocean temperatures. Sandy beaches and surf zones make up over 30% of ice-free shorelines worldwide yet these ecosystems are often understudied particularly in terms of conservation and management. We address this by examining the habitat distributions of a key species of surf zone fish, barred surfperch (Amphistichus argenteus). Surfperches (Family Embiotocidae) are viviparous, producing well-developed juveniles that do not disperse far from natal sites, making populations vulnerable to impacts of climate change. Barred surfperch feed largely on sandy beach invertebrates, a prey resource that will also experience significant impacts of climate-driven beach erosion, increased frequency of storm events, and temperature effects. This species is targeted by recreational shore anglers in California and plays an important role in the coastal food web making barred surfperch a valuable indicator species. To describe the current range of barred surfperch we combined range-wide field surveys with publicly available citizen science observations using Bayesian models with integrated nested Laplace approximation. Integrating multiple data sources, allows us to increase the geographic range of observations while accounting for different sampling methods with different assumed likelihoods. We then project how this range will shift with climate change and explore the role of marine protected areas in providing scope for this species to adapt to climate change. Cumulatively, our results for the current and future distributions of this iconic surf zone fish can inform the management and conservation of a data poor coastal fishery.

Estimating tumor-registry catchment area residency for the adjustment of solid cancer incidence in atomic bomb survivors

Hanna Lindner

The RERF Life Span Study (LSS) tracks cancer incidence in over 120,000 survivors throughout their lifetimes since 1958. At the time of study entry, all participants were residents of their respective city’s cancer registry catchment area. However, not all participants remain within the catchment area over the duration of ongoing follow-up. Highly stratified person-year tables used to compute cancer incidence are thus adjusted by probabilities of residing in the catchment area to avoid underestimating cancer incidence. Migration is not explicitly tracked in the LSS so surrogate markers are used to ascertain subject-specific migration histories in a subset of the LSS. Time to event (in or out migration from the catchment area) data was used to estimate smooth hazard functions for in and out migration by city, sex, attained age, and calendar year. The hazards were then applied iteratively as a function of small time increments to ascertain smooth probability functions by the same covariates.

Survivors exposed at young ages saw the lowest probabilities of residing in the catchment area over time, reaching a nadir in the mid-1980s. Overall, males had lower probabilities than females, and Nagasaki residents lower probabilities than Hiroshima residents. For Nagasaki males, probabilities were as low as 65% in 1985 for those less than 1 year old at the time of exposure. In contrast, the lowest residence probabilities for Hiroshima females were 78% for the same age-year combination.

A migration adjustment has been applied in all prior LSS cancer incidence studies. However, this analysis marks the first time residence probabilities were estimated as smooth functions of attained age and calendar period. Further, changing characteristics of the catchment areas were considered. Increased precision in residency probabilities allows for a better understanding of the effects of catchment area non-residency on estimates of cancer incidence in the LSS cohort.

Careers Session 3

Chair: Melissa Middleton

Statistical Consulting: Bridging the Gap Between Numbers and Advice

Joanne Potts

To be candid, I fell into statistical consulting back in 2012 by virtue of personal circumstances rather than an inherent ambition to run a business, but here we are, turning 12 years young in December and still going strong! Over the past decade I have found statistical consulting to be a highly rewarding profession. I have had the privilege of collaborating with some remarkable colleagues on a variety of interesting projects. I will take the opportunity in this presentation to highlight the enjoyable aspects of my experience as a statistical consultant, teaching professional development workshops (which I love!), and going on the odd field trip to far flung places like Barrow Island to catch burrowing bettongs and participating in detector dog training in Kosciuszko National Park.

In the spirit of honesty and transparency, I will also share the challenges I’ve encountered as well. By doing so, I hope to provide support and guidance to others who may be facing similar obstacles. These challenges include scope-creep, contending with messy datasets (and clients who aren’t exactly sure what they need), managing feelings of professional isolation, handling overwhelming workloads and addressing demanding and occasionally troublesome clients.