Human Bias in Oversight Increases Algorithmic Bias

Date:

Venue: Reshaping Work 2024 Conference: AI at Work

Location: Amsterdam, The Netherlands

Recommended citation: Dores Cruz, T.D. , Starke, C., Katzke, T., Kwiatkowska, M., Köbis, N. C., Müller, E., & Shalvi, S. (2024, October 23). “Human Bias in Oversight Increases Algorithmic Bias, Amsterdam, The Netherlands.

Abstract:

Humans are often used to evaluate Artificial Intelligence products and make them user friendly and bias free. Known as “human-in-the-loop”, people, for example, provide feedback on whether chatGPT is using toxic language, whether Facebook’s algorithms effectively filter hate speech. Indeed, the inclusion of humans in the loop helped create safer environments, improve recommendations, and correct bias emerging from being trained on biased data.While the benefits of having a human-in-the-loop are obvious, if humans are biased, their corrective input may introduce, rather than mitigate, bias in artificial intelligence (algorithms). Indeed, humans often provide feedback and advice that is influenced by their (political or other) motivation or vested interests. A likely setting in which humans might introduce, rather than mitigate, bias is when they are asked to provide input about politically sensitive topics. It is these politicized domains where the most impactful algorithmic decisions are made. These politicized domains are also where there is the most potential for disagreement about how decisions should be made and where a human-in-the-loop can have most impact in swaying algorithms to align with their own political preferences and go counter to opposing preferences.

Consider, for example, the Austrian public employment service (AMS) who used AI to determine the extent to which costly benefits (e.g., training, counselling) are provided to unemployed jobseekers. These recommendations were found to be biased for, among other categories, immigrant versus local candidates. The final decision to which category a candidate is assigned, and thus the extent to which they receive benefits, is made by a public official, who acts as a human-in-the-loop. The algorithm is further trained on the basis of whether candidates across categories succeeded in finding employment, meaning that the final decisions of the humans in the loop impact the algorithm’s future recommendations.

Previous research indicates that people holding different political views vary starkly on the way they believe immigrants versus locals should receive support. People with progressive, left-wing, or democrat political views tend to be more open to benefits being directed towards immigrants. In constrast, people with conservative, right-wing, or republican political views typically oppose directing benefits to immigrants and are in favor of prioritizing locals. Therefore, humans may (knowingly or not) be guided by their political views, leading to calibrating the AI in a politically biased way when decisions concern immigrants and locals.

We conducted a pre-registered experiment in which introduced a simplified version of the AMS algorithm, adapted to the USA. We tasked 1272 democrats and 1272 republicans with accepting or rejecting the recommendations of an algorithm concerning the aproval of applications for benefits submitted by local (American) and immigrants (Mexican) job-seekers.

The algorithm provided a recommendation concerning each applicant based on a resume including 4 attributes with scores from 1 to 5. Participants with an average score of 3 or above should be approved. In three between-subjects conditions, the algorithm provided recommendations following one of three strategies: (1) meritocracy: only considered a candidate’s resume and ignored country of origin, (2) affirmative action: further considered country of origin to favor immigrants when just above or just below the treshold, and (3) America first: further considered country of origin to favor locals when just above or just below the treshold.

Participants observed the applicant’s resume, including the country of origin, know the policy adopted by the AI (meritocracy vs. affirmative action vs. America first) and will be instructed to make a decision to approve or reject the algorithm’s recommendations, which will impact future recommendations of the algorithm. Their decisions are what determines whether a candidate gets benefits or not (operationalized by bonus payments to token participants representing the demographic characteristics of the candidates).

Supporting our predictions, we found that recommendations for candidates that had very high or low resume scores were accepted much more than recommendations for candidates that were just above or below the threshold. Focusing on candidates near the threshold, we found that democrats accepted recommendations to give benefits to immigrants more than republicans while republicans give benefits to locals more than democrats. Furthermore, we found that decisions differed across the biased algorithms for democrats and republicans. For the affirmative action algorithm, democrats accepted much more recommendations for benefits to immigrants than republicans. Yet, as for republicans, democrats accepted less benefits to immigrants than to locals. For the America first algorithm, republicans accepted more recommendations for benefits to locals than democrats. Here, republicans accepted more benefits to locals than to immigrants while democrats accepted more benefits to immigrants than to locals.

Taken together, our study builds an experimental paradigm to gain behavioral insights into human-in-the-loop decision making. Using this experimental paradigm, we show that human-in-the loop decision-making set ups might not always work to debias algorithms because the individual differences of the humans-in-the-loop shape the input humans give to biased algorithms. Specifically, we found that republicans could bias human-in-the loop decision to harm immigrants and benefit locals while democrats could bias towards benefitting immigrants. We are currently working to implement the human feedback we collected as training for algorithms to assess the impact of the feedback on future recommendations.