Deconstructing Performance Reviews

I’ve been noodling on writing this post for quite a while. Mostly because “performance” is such a big topic to unpack, and partly because I’m still working on my own holistic answer. But it’s time to remind myself that “perfect is the enemy of good” and there’s value in trying to summarize my point of view on this topic even if it’s still a work-in-progress.

A good starting point for this conversation is getting on the same page on how we got to where we are today. The Performance Management Revolution provides a great recap of the evolution of performance reviews and the environmental changes that drove it over the last 80(!) years.

One of the key challenges in the debate around performance reviews is that in its 80 years of history the term has gone through what Martin Fowler refers to as “semantic diffusion”:

Semantic diffusion occurs when you have a word that is coined by a person or group, often with a pretty good definition, but then gets spread through the wider community in a way that weakens that definition. This weakening risks losing the definition entirely — and with it any usefulness to the term

Today there is no clear definition of the attributes that turn a particular set of conversations into a “performance review”. Many articles use this term without providing their own definition, assuming that we’re all talking about the same thing.

If we were to boil down performance reviews to a shared core definition, it would be something along the lines of: “a program to facilitate periodic feedback conversations”. Hardly anybody objects to the argument that having periodic feedback conversations is a valuable organizational practice. Most of the criticism that calls for “abolishing the performance review” tends to criticize specific program design elements which they consider to be part of the core definition of what a performance review is. 

It is a fair statement that most performance review programs are designed in ways that ignore some human aspects of the interaction, and especially lessons from the last 30 years of research in the fields of psychology, sociology and neuroscience. This in turn leads many of them to have unintended or sub-optimal results. But this also makes it clear, in my mind at least, that the solution here is to integrate those lessons into the program design rather than get rid of the program altogether.

Humanistic performance reviews principles

With that in mind, I’d like to highlight some of the interim insights that should be taken into account when designing such programs:

  1. Reduce functional overloading — Many programs today suffer from “functional overloading”, we’re trying to do too many things with the same program and end up in a “jack of all trades, master of none” situation since often a program element optimized for one need, causes harm to another. For example, using a performance program to generate documentation of poor performance to minimize legal risk in performance-based terminations, will likely limit the effectiveness of the developmental feedback that it provides. Deconstructing monolithic performance programs and decoupling the components that serve different organizational needs is a good step towards starting to address that challenge.
  2. More frequent, but not too frequent — we are all prone to “recency bias”. Out memory is far from perfect and we tend to overweight the importance of things that have happened more recently in formulating our judgment. This suggests that an annual review cycle is probably too long. But the solution is not real-time/continuous feedback either. There is a lower-bound to the frequency, since we need to give the changes we’ve made in the last cycle enough time to impact the outcomes. Otherwise, we’re just introducing thrash. Furthermore, giving good feedback typically requires a period of reflection and thoughtful composition of the feedback which we would not be able to do if we were to give it “on-the-fly”.
  3. Minimize subjectivity — while subjectivity cannot be eliminated altogether it can certainly be reduced. On the receiving end, by accounting for the overconfidence effect. And on the evaluating end by reducing the idiosyncratic rater effect.
  4. Avoid rating on a “bell curve” — since human performance does not seem to follow a bell curve.
  5. Reduce status threat — a threat to status triggers our fight-or-flight response and diminishes our ability to truly listen and learn. Both rating evaluation and change (or lack thereof) to compensation naturally trigger a threat to status. Separating evaluatory feedback from coaching feedback will increase the efficacy of the latter.
  6. Forward-looking — rather than focus on what happened in the past, the conversation should focus on what should be sustained or changed going forward.
  7. Maximize credibility — the credibility of the person providing feedback effects our motivation to take action based on it. Three key levers can be addressed structurally in the program design: a) a healthy mix of sustain (“positive”) and change (“negative”) feedback b) structuring the feedback in a way that separates facts from interpretations c) making a request to change while hand-in-hand taking responsibility over one’s own interpretations.

A couple of harder questions

The design principles listed above will go a long way in helping to design more effective performance review programs. However, while far from trivial or easy, they are not the hardest part of this challenge.

To use Ronald Heifetz’s distinction, I believe they capture many of the “technical” aspects of the challenge, but it’s the “adaptive” ones — the ones that have to do with the values underlying the system that are the most difficult to address. And those will change from organization to organization.

A couple of harder, more adaptive questions come to mind:

  1. How do we define “performance”? a different definition will lead to different forms of measurement and evaluation: Do we take into account efforts, or just results? Do we account for factors that are outside of our control and influenced the results? How do we deal with the relationship between individual performance and group performance? What about investments that haven’t yielded results just yet? How do we account for intangibles?
  2. What is the role of power in the evaluation of performance? In High Output Management, Andy Grove offers the following:

The review process also represents the most formal type of institutionalized leadership. It is the only time a manager is mandated to act as judge and jury: we managers are required by the organization that employs us to make a judgment regarding a fellow worker, and then to deliver that judgment to him face-to-face.

“This is what I, as your boss, am instructing you to do. I understand that you do not see it my way. You may be right or I may be right. But I am not only empowered, I am required by the organization for which we both work to give you instructions, and this is what I want you to do…”

Some organizations may agree with this definition. Some may not. Their performance review programs will be fundamentally different as a result…

Deconstructing Performance Reviews

Just a Dream [Ruiz]

Following Darren’s advice, a few months ago I read Don Miguel Ruiz’s

The Four Agreements

It’s a short book that I’d highly recommend, but it wasn’t an easy read for me. The highly spiritual context in which it is set constantly conflicted with my extremely rational worldview and required a very deliberate process of parsing out the highly insightful pieces, instead of just writing it all of as “spiritual mumbo-jumbo”. The effort was well worth it.

In a nutshell, the four agreements are:

  1. Be impeccable with your word
  2. Don’t take anything personally
  3. Don’t make assumptions
  4. Always do your best

Certainly good ideals to aspire to even though they’d always be just a little out-of-reach. But the thing that stuck with me most from the book was not the agreements themselves, but an underlying metaphor that Ruiz constantly uses. The metaphor of the Dream:

And he came to the conclusion that human perception is merely light perceiving light. He also saw that matter is a mirror — everything is a mirror that reflects light and creates images of that light — and the world of illusion, the Dream, is just like smoke which doesn’t allow us to see what we really are.

He had discovered that he was a mirror for the rest of the people, a mirror in which he could see himself. “Everyone is a mirror,” he said. He saw himself in everyone, but nobody saw him as themselves. And he realized that everyone was dreaming, but without awareness, without knowing what they really are. They couldn’t see him as themselves because there was a wall of fog or smoke between the mirrors. And that wall of fog was made by the interpretation of images of light — the Dream of humans.

The Dream metaphor is a beautiful one-word summary of the fact that we all experience reality in our own subjective way, and our actions are the result of the way we subjectively interpret that reality.

Not taking things personally is still a big area of growth for me. Here’s how the Dream metaphor can help in that context:

Nothing other people do is because of you. It is because of themselves. All people live in their own dream, in their own mind; they are in a completely different world from the one we live in. When we take something personally, we make the assumption that they know what is in our world, and we try to impose our world on their world.

When you take things personally, then you feel offended, and your reaction is to defend your beliefs and create conflicts. You make something big out of something so little, because you have the need to be right and make everybody else wrong. You also try hard to be right by giving them your own opinions. In the same way, whatever you feel and do is just a projection of your own personal dream, a reflection of your own agreements. What you say, what you do, and the opinions you have are according to the agreements you have made — and these opinions have nothing to do with me.

I found the Dream to be a powerful mnemonic that helps me catch myself when I impulsively take things too personally, defuse from that perception, and create the capacity take more deliberate action instead.

Just a Dream [Ruiz]

Want to improve recruiting? Start by learning from 100 years of research [Schmidt]

I first came across Frank Schmidt’s work while reading Laszlo Bock’s “Work Rules!” [my book review] a couple of years back.

Bock references a 1998 paper written by Schmidt and Hunter as the scientific backing for Google’s interview practices, specifically the use of “work sample tests” and “structured interviews”.

In 2016 Schmidt wrote an updated paper integrating data from 20 additional years of research and improved analysis methods:

The Validity and Utility of Selection Methods in Personnel Psychology: Practical and Theoretical Implications of 100 Years of Research Findings

The clear “winner” in its ability to predict job performance on a standalone basis according to Schmidt’s analysis are “General Mental Ability” (GMA) tests, such as the O*NET Ability Profiler, the Slosson Intelligence Test and the Wonderlic Cognitive Ability Test. These are on average able to predict 65% of a candidate’s job performance. This represents a 14% increase in their predictive ability compared to the ’98 data, unseating “work-sample test” (’98–54%, ’16–33%). The average here only tells part of the story as more refined analysis suggest a significant difference in its predictive ability depending on job type: 74% for professional and managerial jobs, and 39% for unskilled jobs.

Source: Schmidt (2016)

Interestingly, no organization I’ve ever worked for or heard of seems to be using GMA. One reason might be that the consistency and precision in the method, coupled with the large sample sizes make it easier to prove that these tests introduce both gender and racial bias. This seems unfortunate, since none of the other evaluation methods are bias-free, it’s just harder to measure. Being able to measure bias precisely allows us to correct for it, in the short-term — post-hoc, and in the long-term — through better test design.

Next up are employment interviews (58%), where “structured interviews” refer to interviews in which both questions and answers evaluation criteria are consistent across candidates. The MSA and PSQ questions I discussed here are a good example of structured interview questions. The list goes down from there all the way to graphology and age with little to no predictive power. While the two don’t seem to differ in predictive power, unstructured interviews are certainly more bias-prone.

Since GMA seems to be the best measure for making hiring decisions, Schmidt looks at all other measures relative to it, asking the following question:

When used in a properly weighted combination with a GMA measure, how much will each of these measures increase predictive validity for job performance over the .65 that can be obtained by using only GMA?

In this case, the focus shifts from looking solely at their standalone predictive ability and instead also taking into account their covariance with GMA (smaller covariance = better).

The more extensive summary table is shown below but the bottom-line is this:

Overall, the two combinations with the highest multivariate validity and utility for predicting job performance were GMA plus an integrity test (mean validity of .78) and GMA plus a structured interview (mean validity of .76)

Source: Schmidt (2016)

While employment interviews maintain their position at the top of the list, integrity tests such as the Stanton Survey, Reid Report and PSI take the #1 spot. Again, not a tool commonly used today.

So where does all of this leave us? In my opinion it seems like the pendulum in recruiting may have swung too far from quantitative assessment pole to the qualitative assessment pole. It seems like we’d get much better outcomes from our recruiting efforts if GMA and Integrity assessments replaced some of our structured interviews, all the while as we work diligently to remove bias out of our recruiting efforts, regardless of the assessment methods we use.

Want to improve recruiting? Start by learning from 100 years of research [Schmidt]

The Love/Power/Serenity Mental Model

This one has been in the making for quite some time. The more time I spent with this model the more I was able to add to it, but I think it’s ready for a “version 1”.

I first came across it when reading Mastering Leadership in mid-2016:

Read the book or check out the linked post if you’d like to drill deeper, but in a nutshell, the book lays out a highly compelling personal development roadmap for transitioning from a “reactive” state to a “creative” state.

Reactive behaviors are then grouped together into three main “types”: Love, Power and Serenity, or Heart, Will and Head.

We each tend to have a dominant reactive type which usually solidifies in late childhood / early adolescent as a coping mechanism/strategy for the life challenges we faced during those years. Since we develop it at such a young age, we often view it as an integral part of our identity and it becomes the source of both our greatest strengths and our greatest weaknesses. Transitioning from “reactive” to “creative” requires a subject-object shift in our relationship with our reactive type. Being able to see it as something different from “who we are” and deliberately harnessing its strengths while mitigating its shadow side.

The model seems to be theoretically rooted in the work of a German psychoanalyst named Karen Horney. Specifically, her theory of neurosis in which she classified ten patterns of neurotic needs into three buckets which map fairly well to the ones above: moving towards people (compliance), moving against people (expansion, aggression), and moving away from people (detachment).

As you’ll see below, the same 3-type model seems to emerge in rather unpredictable places. But first, a few words of caution.

  1. Anything based on pre-1950 (and I’m being generous here) psychology should be dealt with using a highly critical eye. Psychological theories back then were not based on the application of the scientific method with the same level of rigor that we expect of such theories today. That being said, it does not mean that they do not have strong explanatory power or that they are definitive false/not true.
  2. Any model that’s trying to sum up complex human behavior using a simple framework is inherently inaccurate and incomplete. But again, it does not mean that it cannot have strong explanatory power.

The table below summarizes all the occurrences of the Love/Power/Serenity model I’ve encountered to date. It’s worth saying a few words on each of those :

  • Metaphor — they key metaphor describing each type.
  • Horney coping strategy — they coping strategy used by each type according to Horney’s theory of neurosis.
  • Making meaning through — the lens by which each type makes sense of the world around them; the perspective by which they look at the world.
  • Key need — the key need that drives each type’s behavior.
  • Key fear — the key fear that drives each type’s behavior.
  • Reaffirming group message — I came across this one trying to structure internal communications that appeal to all audiences. This is the key message that each type needs to hear in order to buy into a group decision.
  • Trust orientation — the source of interpersonal trust that each type places the most weight on.
  • Bungay/Adair executive — the Bungay/Adair executive skill set framework that I captured a while back in “Decomposing Leadership — the Executive Trinity” maps rather well onto the three types
  • Reactive tendencies — these are the “Mastering Leadership” reactive tendencies mapped to each type, presented through their gifts/strengths (rather than their weaknesses).
  • Gallup talents — the (in)famous Gallup talents or strengths were initially classified into three main themes that quite neatly into the Love/Power/Serenity types: Relating/Striving/Thinking. In later versions “striving” was split to “striving” and “impacting”.
The Love/Power/Serenity Mental Model

Structures and Mindsets

Granted, there’s probably some recency bias here, since my 2018 questions are still fresh in my mind, but I couldn’t help read:

and not see the structures+mindsets pattern in the post.

I’d actually argue that Rau’s macro framing for the piece is rather misleading or incorrect. She classifies every organization that uses autocratic, majority or consensus-based decision making as authoritarian, oppressive and therefore inherently bad/evil. I’m not sure I’m quite there. Maybe it’s because I’m likely a “constitutionalist” (Theory Y+T) and view different people’s needs as somewhat conflicting/not purely harmonious, or because my default example of an authority relationship is Heifetz’s doctor-patient relationship and I don’t view it as oppressive or bad/evil in any way.

The good news is that Rau’s making some really good points that stand in their own right and can be completely decoupled from that framing.

Her post is a great case study that demonstrates how an external organizational structure, in this case, consent-based decision making, needs to be supported by a mindset shift that’s reflected through a human behavior, in this case, non-violent communication. My three favorite excerpts from her post really drives the point across

“Lack of clarity creates open space for frustration and people’s projections, and they tend to fill them by projecting bad intentions onto others… feelings of frustration are a sign that it is time to put all needs on the table for mutual understanding and exploration. And that’s the point: only good communication can break the downward spiral of blame leading to lack of constructive information. Both sides are fully responsible and need to learn to: speak so their needs can be understood even when they are angry or insecure. listen so they understand others’ needs even when those people express their needs very silently or in a very loud way.

The mindset that supports trust is to shift from blame to curiosity. If you don’t understand why someone would choose a certain strategy, i.e. when you notice yourself thinking “Why would they do that?! That’s so stupid!”, you probably don’t have enough information. The only way to get that information is to ask. Just assuming that the other person is probably just trying to meet a need (without even having to know what need it is) already means you will be more open to taking in what is going on for that other person. Your genuine curiosity will be written on your forehead — if it is real. The message will be “ I assume you are a competent human being trying to meet your needs. Help me understand what your need is and then let’s talk about how the strategy is working for you — and me.”

You need to know what is going on for other people. If you don’t ask because you are afraid of witnessing people’s anger and frustration, you choose to ignore their reality. If you listen and have the courage to take in their anger, that’s the first step towards collective healing.

While only tangentially relevant to the topic of this post, the Marshall Rosenberg quote that Rau also mentions in her post, is just too good to not mention it here as well and wrap up:

“If you want to live in absolute hell, believe that you are responsible for what others feel” — Marshall Rosenberg

Structures and Mindsets

Frameworks merge

Going to keep this one short and sweet.

I was having a conversation with a friend about Ikigai the other week:


And found myself reflecting on why this particular interpretation of it feels so complete.

The 4-part framework and the distinction between aspects that had to do with you and aspects that had to do with the world/others reminded me a lot of Ken Wilber’s four quadrants model which I was introduced to three years ago in the context of culture:


Wilber argues that we can look at any aspect of the human experience through four perspectives which are the combination of an individual vs. collective perspective and an interior vs exterior perspective.

Interestingly, when I overlayed the Ikigai construct on top Wilber’s 4-quadrants model, I got a near perfect fit:

ikigai wilber.png

I believe this may partly be the reason for why this framework is so compelling.

Frameworks merge

The Likert scale is killing your developmental feedback


The Likert scale is a commonly used question structure in surveys, in which the survey taker is asked to rate their agreement/disagreement with a series of statements on a 5-point scale ranging from “Strongly Disagree” to “Strongly Agree.” For the purpose of this post, I’ll expand the definition a bit to also include:

  • The more “direct” version which asks for a 1-5 rating on a certain attribute/prompt
  • The more “indirect” version which replaces the wording with a “Focus on Least” to “Focus on Most” scale,  or any other set of labels.
  • Any variation in the number ratings: 7 is the second most common variation, after 5, but some prefer an even number of ratings to force survey-takers to choose a non-neutral answer.

The Likert scale is also commonly used in feedback forms that are aimed at driving continuous improvement and development at the organizational and personal levels, such as employee engagement surveys and performance/effectiveness reviews.

To understand the challenge of using the Likert scale in that context we need to first make a distinction between three types of feedback, courtesy of the folks at Triad Consulting:

  • Praise (appreciation) – aimed at showing the receiver that you see their good work and connect, motivate and thank them for it.
  • Coaching (developmental) – aimed at helping the receiver expand their knowledge, sharpen a skill or improve a capability.
  • Evaluation (assessment) – rating or ranking against a standard aimed at helping the receiver align their expectations and know where they stand.

One of the fundamental elements of Quantum Mechanics is a principle known as Heisenberg’s Uncertainty Principle. In simple terms, it states that there’s a limit to the precision by which certain pairs of properties of a particle can be known (measured) at the same time. For example, the more precisely you’ll measure a particle’s speed, you’ll know its momentum less accurately.

It’s been my experience that a similar relationship exists between developmental and assessment feedback. The more accurate evaluatory feedback you give, the less effective it will be as a developmental feedback. And the more accurate developmental feedback you give, the less effectively it can be used to assess someone’s current performance.

With these ideas in mind, we can now make our case: the Likert scale is an evaluatory construct since it produces an absolute rating which is then compared to a standard: historical ratings, peer ratings, or a threshold (above “3.5” is “good”). But if our goal is to learn, if our goal is to help people and organizations improve and develop – we may be hurting ourselves by using it.

So what’s the alternative? Good question. Properly phrased open-ended questions can effectively be used to provide developmental feedback, but they are very costly to design, and they are very costly to answer. I suspect that some of the motivation for using the Likert scale structure, to begin with, was the fact that it’s very lightweight. A good alternative might be stack-ranking the various attributes that we may have used a Likert scale for in the past. For example: from the one we should focus the most on to the one we should focus the least on. This approach moves us from the absolute to the relative and therefore, while not perfect, does avoid some of the common pitfalls with the old solution.

The Likert scale is killing your developmental feedback