EVELYN: Experiment Velocity Engine Lifting Your Numbers

What is EVELYN?

EVELYN is a central place to track experiments, starting at the idea stage. It stands for “Experiment Velocity Engine, Lifting Your Numbers.”

As used on the Experimentation + Testing Deep Dive course at Reforge.

There are 4 tables:

Experiments - All experiments in EVELYN, with many specific views.
Metrics - The metrics our experiments aim to move.
Surfaces - The surfaces on which our experiments run with list of current experiments for each.
Populations - The populations to which our experiments are targeted.

Recommended Views

For each team running experiments, we recommend having at least 2 views:

<Team Name> Idea Backlog - showing Status=[Idea,Sized Idea]
<Team Name> In Progress - showing Status=[Specing/Designing,Building,Waiting to Run,Running,Analyzing,Concluding]

Fields

EID: Unique ID, autonumber. Good for identifying experiments across specs and tickets.

Name: Name or short description. Text. Name or brief 5-20 word description of experiment that is easily understandable to outside teams, e.g. “Upsell on Downgrades.”

Description: Thorough description, to avoid opening spec doc for context.

Spec: URL Link to spec for experiment.

**Team: **Team owning experiment. Select from <list of teams>.

Status: Status in process.

Sizing: Data/info needed to size; the textual description of data we should look at to see if this is a valuable idea, e.g., views of starting experience, current conversion rate, potential users affected.

Blocked: Checkbox to flag if an experiment is blocked. “Notes” should contain reason.

Notes: Text notes about experiment. Notes should be used to provide a quick update as to where the project stands, or if there are complications, e.g., "Experiment blocked due to Admin Console refactor."

Tags: Free-form tags for projects. Field to support categorizations of projects that don’t quite merit their own column, esp. temporary-but-important classifications like “Taurus Blocker."

Suggested By: Collaborator field; person suggesting idea.

Status Options

**Idea **- New idea, no T-shirt sizing
**Sized Idea **- Idea + T-shirt eng/design costing and T-shirt impact sizing.
Specing/Designing - Currently being spec’d (possibly numerically oppt sized/costed)
Building **- **Experiment is being built and QA’d
Waiting to Run - Completed building, on production and ready to be turned on
Running - Experiment is running on production
Analyzing - Completed gather data. Looking at data and coming to a decision on next steps
Concluding - We’ve made the decision and now we’re executing it
Complete - Experiment is complete, and code has been cleaned up

Idea Backlog

For sizing ideas and selecting which to build.

Action to take:

Add T-shirt sizing for: Impact, Metric, Eng Cost and Design Cost
- When T-shirt sized, set Status→Sized Idea
When ready to spec, set Status→Design/Spec
- This will remove from view, move to Designing/Specing view

Metric: Metric this experiment will affect. Linked-to-Table: <Metrics>. Metric with name and units, ie “ARR ($M)” or “Activation (% abs)”. This determines the units of the Oppt Size and Metrics Result columns.

T-Shirt: Score: T-shirt impact score formula, i.e. the effort::reward ratio for this idea, broadly

Formula: {T-Shirt#: Impact} - {T-Shirt#: Eng Cost} - {T-Shirt#: Design Cost} + {T-Shirt#: Adjustment}.

T-Shirt: Impact: T-shirt impact estimate, i.e. impact this experiment could have, broadly. Select: Low, Medium, High (maps to 1,5,10).

T-Shirt: Eng Cost: T-shirt estimate of eng cost. Select: None, Cheap, Medium, Expensive (maps to 1,2,3,4).

T-Shirt: Design Cost: T-shirt estimate of design cost. Select: None, Cheap, Medium, Expensive (maps to 1,2,3,4).

T-Shirt: Adjustment: T-shirt adjustment for other factors. Note: not required for T-Shirt Score to be calculated. Optional select: Very Hard, Hard, Easy, Very Easy (-2, -1, +1, +2). Provides easy way to adjust for a difficult surface or other factors. E.g. an experiment on a busy surface might be flagged as "Hard" to lower its T-Shirt: Score.

(Hidden fields): Note: We have "T-Shirt #: X" fields for each of the T-Shirt fields to make them numeric.

Designing/Specing

Action to take:

Add Oppt Sizing: Oppt Size, Confidence, Eng Estimate, Design Estimate
Add Owner, Designer, Engineer on project
Add Rule, Surfaces and Population
Optionally add: expected Start/End (run time), Target GA Quarter, Designs
When ready to build, set Status→Building
- This will remove from view, move to Building view

Oppt Size: Estimated Metric win if successful. Amount we’d move the Metric if this experiment works. Units are on the “Metric” column value.

Confidence: How confident we are that this experiment will win, expressed as a percent chance of success. In general:

Low confidence (10-30%) for new surfaces
Medium (~50%) for known surfaces
High (~80%) for well-understood experiments (such as successful experiments on a new surface)

i.e., if you run many 50% experiments, you would expect around half of them to be successful.

Project: Sub-team objective or project, if any. Set to “Foundational” for non-experiments. Flexible field so teams can specify project or team objective. E.g: Foundational, Basic signup suggestion (on Intent), etc. Non-experiments should have this field set to “Foundational.”

Eng Estimate: Estimated weeks for engineering to build. Fractional weeks of fully-devoted eng time, i.e. “1” or “0.5” or “4.5” etc.

Design Estimate: Estimated weeks for design to design. Weeks of fully-devoted eng time, e.g. “1” or “0.5” or “4.5” etc.

Oppt ROI Estimate: Expected result per week of effort. Formula: {Oppt Size}*Confidence/({Eng Estimate}+{Design Estimate}).

Owner: Owner, e.g. person running the experiment, any function.

Designer: Designer for this experiment.

Engineer: Engineer building and/or owning.

Rule: Rule name for this experiment. Unique text string of rule this experiment is key off of.

Surfaces: Surfaces your experiment or feature will affect. [Linked Record with Surfaces]. Either part of the site, feature, flow or url. Please be specific and exhaustive as we will use this to avoid conflicts and do historical analysis. Example values: mobile signup, referral onboarding, workspace upgrade, etc.

Populations: Populations this exp is targeted at. Linked Record with Populations table, i.e. “Basic, EN.”

Designs: Designs associated with experiment. Image files (any format) for this experiment. Could be screenshots, mocks, wireframes, etc. Must be image file and directly uploaded, not url or InVision, etc.

Start: Date the experiment is planned to start running. Also see: "_Start: Running" at bottom

End: Date the experiment is planned to stop running.

Target GA Quarter: Quarter this is planned for GA, if any. Quarter and year, i.e. “Q2 2018.”

Building + Running

Actions to take:

Set Engineer if blank
Change Status→Building on start of eng work
Change Status→Waiting to Run when build complete
- Removes from this view, adds to Running view
When completed running, change Status→Analyze
- Removes from this view, adds to below view

Sprint: Sprint we plan to tile this against. Teams may use this to note the dash they expect this to land in or want to be tiled in. This could be used to filter tasks for tiling, and to figure out our team’s eng velocity. We don’t include year since most year-old experiments will be filtered out since their Status→Complete.

Sprint Commits: Engineering, product and design goals and grades in a sprint/dash for a specific project. Teams may use this to scope and grade sprint commitments, with priority first then goal, ie: "P1: Build." At the end of a sprint/dash, we can add grades in the form of an emoji, put first for a clean look: "P1: Build :) "

Analyzing / Concluding

Actions to take in this view:

When completed analyzing, change Status→Concluding
When completed clean up/GA, change Status→Complete
- Removes from this view, adds to Complete view

Metric Result: How much we moved the Metric. Expressed in number with 3 decimal places, ie “1.000” or “0.030”. Units are in Metric column. I.e., if the metric were “Activation (% abs)” and the Metric Result was ”5.000”, then it would mean the experiment resulted in a 5% absolute gain.

Learnings: Learnings we gained from running this experiment. Textual learnings, such as: “Telling users to Buy Now successfully got them to purchase immediately.” Could include link to Result paper doc.

GA’d Date: Date on which we GA’d this experiment. Date specifying when we did (not “will”) GA the experience to a set of users. This could be either setting the Stormcrow rule to 100% new variant or releasing the updated code. Ideally, this GA date is the date we started realizing the full Metric Result.

Cleaned Up: Whether this experiment is cleaned up. Checkbox: True - if cleaned up, else False/blank.

Complete (Wins/Losses)

Supports viewing past experiments, sipping a glass in quiet reflection.

Hidden Fields

Fields in the background, keeping the beat, like a bass guitar.

Explore

Updated December 5, 2020 at 12:40 AM

Copied 464 times

Derek Sakamoto

Explore the base

Experiments

All Experiments

Surface Areas

Ideas - Conversion Marketing

Product Growth - In Progress

Complete (Wins/Losses)

Person: Darius

Pipeline: Revenue #M ARR

Metrics

Grid view

Surfaces

Grid view

Populations

Grid view

UNIVERSE

EVELYN: Experiment Velocity Engine Lifting Your Numbers

#Product, design, and UX

#Marketing and sales

#Professional

What is EVELYN?

Recommended Views

Fields

Status Options

Idea Backlog

Designing/Specing

Derek Sakamoto