EVELYN is a central place to track experiments, starting at the idea stage. It stands for “Experiment Velocity Engine, Lifting Your Numbers.”
As used on the Experimentation + Testing Deep Dive course at Reforge.
There are 4 tables:
For each team running experiments, we recommend having at least 2 views:
EID: Unique ID, autonumber. Good for identifying experiments across specs and tickets.
Name: Name or short description. Text. Name or brief 5-20 word description of experiment that is easily understandable to outside teams, e.g. “Upsell on Downgrades.”
Description: Thorough description, to avoid opening spec doc for context.
Spec: URL Link to spec for experiment.
**Team: **Team owning experiment. Select from <list of teams>.
Status: Status in process.
Sizing: Data/info needed to size; the textual description of data we should look at to see if this is a valuable idea, e.g., views of starting experience, current conversion rate, potential users affected.
Blocked: Checkbox to flag if an experiment is blocked. “Notes” should contain reason.
Notes: Text notes about experiment. Notes should be used to provide a quick update as to where the project stands, or if there are complications, e.g., "Experiment blocked due to Admin Console refactor."
Tags: Free-form tags for projects. Field to support categorizations of projects that don’t quite merit their own column, esp. temporary-but-important classifications like “Taurus Blocker."
Suggested By: Collaborator field; person suggesting idea.
For sizing ideas and selecting which to build.
Action to take:
Metric: Metric this experiment will affect. Linked-to-Table: <Metrics>. Metric with name and units, ie “ARR ($M)” or “Activation (% abs)”. This determines the units of the Oppt Size and Metrics Result columns.
T-Shirt: Score: T-shirt impact score formula, i.e. the effort::reward ratio for this idea, broadly
Formula: {T-Shirt#: Impact} - {T-Shirt#: Eng Cost} - {T-Shirt#: Design Cost} + {T-Shirt#: Adjustment}.
T-Shirt: Impact: T-shirt impact estimate, i.e. impact this experiment could have, broadly. Select: Low, Medium, High (maps to 1,5,10).
T-Shirt: Eng Cost: T-shirt estimate of eng cost. Select: None, Cheap, Medium, Expensive (maps to 1,2,3,4).
T-Shirt: Design Cost: T-shirt estimate of design cost. Select: None, Cheap, Medium, Expensive (maps to 1,2,3,4).
T-Shirt: Adjustment: T-shirt adjustment for other factors. Note: not required for T-Shirt Score to be calculated. Optional select: Very Hard, Hard, Easy, Very Easy (-2, -1, +1, +2). Provides easy way to adjust for a difficult surface or other factors. E.g. an experiment on a busy surface might be flagged as "Hard" to lower its T-Shirt: Score.
(Hidden fields): Note: We have "T-Shirt #: X" fields for each of the T-Shirt fields to make them numeric.
Action to take:
Oppt Size: Estimated Metric win if successful. Amount we’d move the Metric if this experiment works. Units are on the “Metric” column value.
Confidence: How confident we are that this experiment will win, expressed as a percent chance of success. In general:
i.e., if you run many 50% experiments, you would expect around half of them to be successful.
Project: Sub-team objective or project, if any. Set to “Foundational” for non-experiments. Flexible field so teams can specify project or team objective. E.g: Foundational, Basic signup suggestion (on Intent), etc. Non-experiments should have this field set to “Foundational.”
Eng Estimate: Estimated weeks for engineering to build. Fractional weeks of fully-devoted eng time, i.e. “1” or “0.5” or “4.5” etc.
Design Estimate: Estimated weeks for design to design. Weeks of fully-devoted eng time, e.g. “1” or “0.5” or “4.5” etc.
Oppt ROI Estimate: Expected result per week of effort. Formula: {Oppt Size}*Confidence/({Eng Estimate}+{Design Estimate}).
Owner: Owner, e.g. person running the experiment, any function.
Designer: Designer for this experiment.
Engineer: Engineer building and/or owning.
Rule: Rule name for this experiment. Unique text string of rule this experiment is key off of.
Surfaces: Surfaces your experiment or feature will affect. [Linked Record with Surfaces]. Either part of the site, feature, flow or url. Please be specific and exhaustive as we will use this to avoid conflicts and do historical analysis. Example values: mobile signup, referral onboarding, workspace upgrade, etc.
Populations: Populations this exp is targeted at. Linked Record with Populations table, i.e. “Basic, EN.”
Designs: Designs associated with experiment. Image files (any format) for this experiment. Could be screenshots, mocks, wireframes, etc. Must be image file and directly uploaded, not url or InVision, etc.
Start: Date the experiment is planned to start running. Also see: "_Start: Running" at bottom
End: Date the experiment is planned to stop running.
Target GA Quarter: Quarter this is planned for GA, if any. Quarter and year, i.e. “Q2 2018.”
Building + Running
Actions to take:
Sprint: Sprint we plan to tile this against. Teams may use this to note the dash they expect this to land in or want to be tiled in. This could be used to filter tasks for tiling, and to figure out our team’s eng velocity. We don’t include year since most year-old experiments will be filtered out since their Status→Complete.
Sprint Commits: Engineering, product and design goals and grades in a sprint/dash for a specific project. Teams may use this to scope and grade sprint commitments, with priority first then goal, ie: "P1: Build." At the end of a sprint/dash, we can add grades in the form of an emoji, put first for a clean look: "P1: Build :) "
Analyzing / Concluding
Actions to take in this view:
Metric Result: How much we moved the Metric. Expressed in number with 3 decimal places, ie “1.000” or “0.030”. Units are in Metric column. I.e., if the metric were “Activation (% abs)” and the Metric Result was ”5.000”, then it would mean the experiment resulted in a 5% absolute gain.
Learnings: Learnings we gained from running this experiment. Textual learnings, such as: “Telling users to Buy Now successfully got them to purchase immediately.” Could include link to Result paper doc.
GA’d Date: Date on which we GA’d this experiment. Date specifying when we did (not “will”) GA the experience to a set of users. This could be either setting the Stormcrow rule to 100% new variant or releasing the updated code. Ideally, this GA date is the date we started realizing the full Metric Result.
Cleaned Up: Whether this experiment is cleaned up. Checkbox: True - if cleaned up, else False/blank.
Complete (Wins/Losses)
Supports viewing past experiments, sipping a glass in quiet reflection.
Hidden Fields
Fields in the background, keeping the beat, like a bass guitar.