There is a trend toward broiler production systems with higher welfare requirements, that use slower growing broiler strains, apply a reduced stocking density and provide environmental enrichment. Although these separate factors each contribute to increased broiler welfare, there is little information on their combined effect on broiler welfare under commercial conditions, and on the variation in welfare performance of flocks within production systems. The aim of this study was to compare the welfare performance and the between-flock variation in welfare of 3 Dutch commercial broiler production systems differing in welfare requirements: Conventional (C), Dutch Retail Broiler (DRB) and Better Life one star (BLS). We applied a welfare assessment method based on the Welfare Quality broiler assessment protocol, in which we used 5 animal-based welfare measures collected by slaughterhouses and hatcheries (mortality, footpad dermatitis, hock burn, breast irritation, scratches), and 3 resource- or management-based measures (stocking density, early feeding, environmental enrichment). Data were collected for at least 1889 flocks per production system over a 2-year period. To compare the different measures and to generate an overall flock welfare score, we calculated a score on a scale from 0 to 100 (bad-good) for each measure based on expert opinion. The overall flock score was the sum of the scores of the different welfare measures. The results showed that with increasing welfare requirements, a higher total welfare score was found across production systems (BLS > DRB > C; P < 0.0001). Regarding individual measures, C generally had lower (worse) scores than BLS and DRB (P < 0.05), except for scratches where C had highest (best) score (P < 0.001). Both welfare measure scores and the total welfare score of flocks showed large variation within and overlap between systems, and the latter especially when only the animal-based measures were included in the total flock score. Total flock score ranges including animal-based measures only were: 112.1 to 488.3 for C, 113.0 to 486.9 for DRB, 151.3 to 490.0 for BLS (on a scale from 0 [bad]–500 [good]), with median values of 330.8 for C, 370.9 for DRB, and 396.1 for BLS respectively. This indicates that factors such as farm management and day-old chick quality can have a major effect on the welfare performance of a flock and that there is room for welfare improvement in all production systems.