We analyzed an ensemble of watershed models that predict flow, nitrogen, and phosphorus discharges. The models differed in scope and complexity and used different input data, but all had been applied to evaluate human impacts on discharges to the Patuxent River or to the Chesapeake Bay. We compared predictions to observations of average annual, annual time series, and monthly discharge leaving three basins. No model consistently matched observed discharges better than the others, and predictions differed as much as 150% for every basin. Models that agreed best with the observations in one basin often were among the worst models for another material or basin. Combining model predictions into a model average improved overall reliability in matching observations, and the range of predictions helped describe uncertainty. The model average was not the closest to the observed discharge for every material, basin, and time frame, but the model average had the highest NashSutcliffe performance across all combinations. Consistently poor performance in predicting phosphorus loads suggests that none of the models capture major controls. Differences among model predictions came from differences in model structures, input data, and the time period considered, and also to errors in the observed discharge. Ensemble watershed modeling helped identify research needs and quantify the uncertainties that should be considered when using the models in management decisions.