Abstract
Clinical trials frequently lack a single definitive endpoint that completely describes treatment efficacy. When a treatment affects a disease in a multitude of ways, several endpoints are necessary to describe efficacy. There is a variety of statistical procedures to provide a single p-value when a treatment affects several endpoints. This paper reviews several procedures including Hotelling's T2 test, an approximate likelihood ratio test (Tang et al.), the weighted version of O'Brien's test, tests involving the maximum of several test statistics, and a test based on the average of the maximum of several endpoints (Wittes). I propose a risk score test whose rejection boundary corresponds to a contour of constant risk. Calculations and simulation studies help to compare the different tests with an emphasis on the effect of non-standard alternatives, and on identifying settings where some tests may lack clinical relevance.