CSS Coverage Adventures

Introduction

Every side project starts with an itch. I recently read this article, which reports an extract from a podcast where Kent C. Dodds explains how, at Paypal, they used to have 90% unused CSS on the project he was using. This can be explained by the fact that CSS rules have a global scope by default and for large projects it is not always obvious where the CSS is used, if it used at all. Developers will often prefer leaving unused CSS rather than risking breaking something in an expected place and over time this can result in large files containing mostly unused CSS.

This prompted me to start checking the coverage on sites I commonly visit. I discovered that files with less than 25% of used bytes were fairly common and as I was looking at coverage reports for Bootstrap stylesheets with ridiculously low code usage, I wondered how I could get a coverage report for an entire site and not just for one or a couple of pages.

In lieu of this hypothetic "fancy tooling", I thought maybe a bit of command-line voodoo could do the trick. The author also points out that it can be particularly hard to identify unused CSS across a website because you need to check:

every url
every interaction
different states in the case of single page applications
different media queries

Sure, if you are the website administrator and wish to know which lines of CSS you can safely delete, then you need an exhaustive method. But if you just want an estimate of how much unused CSS is downloaded by a regular user, then exhaustivity doesn't really matter as long as you simulate a standard user's navigation on the website. You can miss a couple of interactions and still get interesting information.

I initially tried to download the pages with wget but I realized that some links were rendered dynamically on the client side so I opted to do everything with Puppeteer instead. That way I could download the page and get the coverage report at the same time, saving time and more importantly, reducing traffic to the website.

All the scripts can be found here.

#1 - Getting coverage data

To get coverage data in Chromium-based browsers you can open the command menu with Ctrl Shift p and find the "Show Coverage" item. You can then record the coverage while interacting with the page and when you are done, download a JSON file with all the data.

By default the entire file is included in the "text" field of of the coverage report. You may want to remove this field before writing to disk to save space. This can be easily automated with Puppeteer, here is a simple script to record and save CSS coverage:

#!/usr/bin/node

const puppeteer = require('puppeteer');
const fs = require('fs');

async function get_coverage(){

    const browser = await puppeteer.launch({
        headless: false
    });

    const page = await browser.newPage();
    await page.setCacheEnabled(false);
    await page.setViewport({
        width: 1920,
        height: 1080
    })

    await page.coverage.startCSSCoverage();

    await page.goto( "https://lemonde.fr", {
        waitUntil: 'networkidle2',
    });

//---------------
//    interactions here
//---------------

    css_coverage = await page.coverage.stopCSSCoverage();

    // remove resource contents
    css_coverage.forEach( entry => delete entry.text)

    fs.writeFile("coverage_log", JSON.stringify( css_coverage ), err =>{
        if(err) {
            console.error(err);
            process.exit(1);
        }
    })

    await page.close();
    await browser.close();
}
get_coverage();

To get a more complete script, you can then add a command-line interface, exception handling, logging, etc.

Now, if you want a coverage report that covers a large number of pages, you can't either start recording and visit all the URLs or you can record independent coverage reports and aggregate them later. The former approach can rapidly become unpractical for large websites. The latter offers much more flexibility as you can easily add or remove a number of URLs from the analysis without having to re-crawl hundreds or thousands of pages.

For a static site or a site that is not very interactive, the process is straightforward:

Download the landing page of a website (or any other page in the website)
Record and save the CSS coverage
Extract all links to other parts of the website
Visit those pages and repeat the process
When no new URL is discovered, aggregate the results

#2 - Enumerating all the discoverable URLs

The next step after downloading a page is extracting all the urls from the HTML. One solution could have been to use Puppeteer to select all the link elements and get the href attribute. Another solution could be using an HTML parsing library like Beautiful Soup but it seemed unnecessary; after all, HTML is text and a couple of regexes should work fine.

# put every tag on its own line for easier parsing
sed -r 's/>/>
/g' "$i" |

# only keep the URLs from the link elements
grep -E 'href' | sed -r "s/.*href=["']([^ ]*)["'].*/\1/" |

# sort alphabetically and remove duplicate URLs
sort -u |

# exclude links to files, anchors
grep -vE 'css$|js$|ico$|png$|jpg$|json$|xml$|#' |

# transform absolute URLs into relative URLs
sed -r 's/https:\/\/(www\.)?lemonde\.fr//' |

# exlude all other domains
grep -Ev 'https?'

#3 - Aggregating results

At this point, all we have to do is:

For one stylesheet (ex: main.css), find all the .coverage files that contain a corresponding entry.
Concatenate all ranges of used bytes.

To easily parse json files from the command line we can use jq, then it is only a matter of sorting the ranges and "collapse" the overlapping ranges to get a final coverage report.

# read all coverage files
cat level_*/*.coverage |

# select only ranges that correspond to the url "https://lemonde.fr/main.css"
jq -r '.[] | 
    select(.url=="https://lemonde.fr/main.css") | 
    .ranges | 
    .[] | 
    "\(.start)-\(.end)"'  |

sort -t"-" -nu |

# "concatenate" consecutive overlapping ranges of numbers
awk -F'-' -v left=0 -v right=0 '

    NR==1{left=$1;right=$2} 
    NR>=2{ 
        if($1 < right) 
            right=$2
        else
            print left" "right
            left=$1
            right=$2

}'

At this point, all you need to do is a bit of arithmetic, i.e. add all the ranges together and compare to the file size in bytes.