Looking to buy a diamond? This how-to guide describes three steps I took to identify great deals on Blue Nile. “Founded in 1999, Blue Nile has grown to become the largest online retailer of certified diamonds and fine jewelry.” The code used in this analysis is available on GitHub. This guide proceeds as follows:

- download data from Blue Nile,
- model price as a function of diamond characteristics, and
- identify diamonds with extra low prices.

I’ve written a Python script to make downloading data from Blue Nile easy. The script has been posted here. To download data on all round diamonds on Blue Nile use the following command:

```
python download.py --shape RD > my-diamonds.csv
```

For more information on the optional arguments the script accepts use:

```
python download.py --help
```

Most of the download script is pretty easy to follow. Blue Nile is using Apache Solr to serve JSON documents describing diamonds on the site. The trickiest part is you can only get information on the first 1,000 diamonds for each query; Blue Nile has limited how far we can page through results. To work around this constraint, the download script pages through results based on price. I only mention this if you want to dig deeper into the download script.

On November 6, 2015, I downloaded data on all 98,886 round diamonds on Blue Nile. Below is a plot of diamond price versus carat weight (both on log scales).

Blue Nile’s buying guide describes how the four C’s (cut, color, clarity, and carat weight) are the most important characteristics when buying a diamond. It seems reasonable to model price as a function of those four characteristics. Having played around with the data bit, a multiplicative model seems like a good choice. I model price as a product of carat weight raised to the power $\beta$ times multipliers for the cut, color, and clarity of the diamond $$ price_i \propto carat_i^\beta \cdot cut_i \cdot color_i \cdot clarity_i. $$ Taking $\log$’s of both sides allows this model to be estimated using a linear regression $$ \log(price_i) = \alpha + \beta \log(carat_i) + \delta_{cut_i} + \delta_{color_i} + \delta_{clarity_i} + \epsilon_i. $$ Focusing on diamonds weighing between 1.00 and 1.99 carats, we can see the relationship between $\log(price_i)$ and $\log(carat_i)$ is remarkably linear, with diamond color shifting the intercept but not the slope of the relationship.

Below is a summary of the fitted linear model. Generally, I put very little weight on R-squared values, but this model explains 91.5% of the observed variance in log price!

```
##
## Call:
## lm(formula = fstring, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58043 -0.09121 0.00130 0.08608 0.62971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.904912 0.004815 1641.74 <2e-16 ***
## log(carat) 1.695256 0.005421 312.69 <2e-16 ***
## cut2 0.067069 0.003694 18.16 <2e-16 ***
## cut3 0.135940 0.003307 41.11 <2e-16 ***
## cut4 0.352969 0.005307 66.51 <2e-16 ***
## color2 0.161501 0.003914 41.26 <2e-16 ***
## color3 0.300675 0.003900 77.10 <2e-16 ***
## color4 0.409028 0.003890 105.16 <2e-16 ***
## color5 0.505657 0.003945 128.16 <2e-16 ***
## color6 0.587193 0.003949 148.71 <2e-16 ***
## color7 0.739886 0.004113 179.90 <2e-16 ***
## clarity2 0.155532 0.003710 41.92 <2e-16 ***
## clarity3 0.275004 0.003591 76.58 <2e-16 ***
## clarity4 0.341406 0.003642 93.74 <2e-16 ***
## clarity5 0.400936 0.003921 102.26 <2e-16 ***
## clarity6 0.498296 0.004064 122.61 <2e-16 ***
## clarity7 0.642261 0.005277 121.70 <2e-16 ***
## clarity8 0.952975 0.021096 45.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1408 on 19997 degrees of freedom
## Multiple R-squared: 0.9153, Adjusted R-squared: 0.9152
## F-statistic: 1.27e+04 on 17 and 19997 DF, p-value: < 2.2e-16
```

Exponentiating the coefficients from the regression model gives estimates of the price multipliers associated with different diamond characteristics. These multipliers can help a shopper decide what type of diamond to consider. The omitted categories (cut = Good, color = J, and clarity = SI2) have implicit coefficients of 0 and price multipliers of 1. Is a G-color diamond worth 1.51 times the price of a J-color diamond with the same cut, clarity, and carat weight?

Having read Blue Nile’s buying guide a few times, they’ve convinced me to care about all four of the four C’s. When purchasing a diamond, the following cut, color, and clarity are my baseline:

$cut_i \ge$ Ideal: Represents roughly the top 3% of diamond quality based on cut. Reflects nearly all light that enters the diamond. An exquisite and rare cut.

$color_i \ge$ H: Near-colorless. Color difficult to detect unless compared side-by-side against diamonds of better grades. An excellent value.

$clarity_i \ge$ VS1: Very Slightly Included: Imperfections are not typically visible to the unaided eye. Less expensive than the VVS1 or VVS2 grades.

Below I plot the diamonds that meet my baseline. Fitting a linear relationship between $\log(price_i)$ and $\log(carat_i)$, I highlight the best 1% of deals, the diamonds where the difference between expected and actual price is greatest $$ \alpha + \beta \log(carat_i) - \log(p_i) = -\epsilon_i. $$

The table below describes the top 10 diamonds found using my criteria. All of these diamonds have an ideal cut and H-color.

Residual | Clarity | Carat | Price |
---|---|---|---|

-0.63 | VS1 | 1.32 | 7,238 |

-0.61 | VS1 | 1.15 | 5,830 |

-0.59 | VS1 | 1.30 | 7,344 |

-0.59 | VS1 | 1.30 | 7,346 |

-0.59 | VVS2 | 1.62 | 10,866 |

-0.59 | VS1 | 1.11 | 5,575 |

-0.58 | VS1 | 1.22 | 6,623 |

-0.58 | VS1 | 1.15 | 5,999 |

-0.58 | VS1 | 1.21 | 6,566 |

-0.57 | VVS2 | 1.32 | 7,691 |

Disclaimer: This is one way to identify deals. A more general solution would allow shoppers to enter preference parameters similar to the regression coefficients found above. Taking preference parameters $\beta$, the best deals would maximize the shopper’s utility $$ u(X_i, p_i) = X_i \beta - p_i. $$