With the growing amount of historical infrastructure data available to engineers, data-driven techniques have been increasingly employed to forecast infrastructure performance. In addition to algorithm selection, data preprocessing strategies for machine learning implementations plays an equally important role in ensuring accuracy and reliability. The present study focuses on pavement infrastructure and identifies four categories of strategies to preprocess data for training machine-learning-based forecasting models. The Long-Term Pavement Performance (LTPP) dataset is employed to benchmark these categories. Employing random forest as the machine learning algorithm, the comparative study examines the impact of data preprocessing strategies, the volume of historical data, and forecast horizon on the accuracy and reliability of performance forecasts. The strengths and limitations of each implementation strategy are summarized. Multiple pavement performance indicators are also analysed to assess the generalizability of the findings. Based on the results, several findings and recommendations are provided for short-to medium-term infrastructure management and decision-making: (i) in data-scarce scenarios, strategies that incorporate both explanatory variables and historical performance data provides better accuracy and reliability, (ii) to achieve accurate forecasts, the volume of historical data should at least span a time duration comparable to the intended forecast horizon, and (iii) for International Roughness Index and transverse crack length, a forecast horizon up to 5 years is generally achievable, but forecasts beyond a three-year horizon are not recommended for longitudinal crack length. These quantitative guidelines ultimately support more effective and reliable application of data-driven techniques in infrastructure performance forecasting.