---
title: "Data Formatting and Encoding"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data Formatting and Encoding}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  fig.retina = 3,
  comment = "#>"
)
```

```{r, child='../man/rmdchunks/header.Rmd'}
```

# Basic required format

```{r, child='../man/rmdchunks/dataFormat.Rmd'}
```

The {logitr} package contains several example data sets that illustrate this data structure. For example, the `yogurt` contains observations of yogurt purchases by a panel of 100 households [@Jain1994]. Choice is identified by the `choice` column, the observation ID is identified by the `obsID` column, and the columns `price`, `feat`, and `brand` can be used as model covariates:

```{r}
library("logitr")

head(yogurt)
```

This data set also includes an `alt` variable that determines the alternatives included in the choice set of each observation and an `id` variable that determines the individual as the data have a panel structure containing multiple choice observations from each individual.

# Continuous versus discrete variables

Variables are modeled as either continuous or discrete based on their data _type_. Numeric variables are by default estimated with a single "slope" coefficient. For example, consider a data frame that contains a `price` variable with the levels $10, $15, and $20. Adding `price` to the `pars` argument in the main `logitr()` function would result in a single `price` coefficient for the "slope" of the change in price.

In contrast, categorical variables (i.e. `character` or `factor` type variables) are by default estimated with a coefficient for all but the first level, which serves as the reference level. The default reference level is determined alphabetically, but it can also be set by modifying the factor levels for that variable. For example, the default reference level for the `brand` variable is `"dannon"` as it is alphabetically first. To set `"weight"` as the reference level, the factor levels can be modified using the `factor()` function:

```{r}
yogurt2 <- yogurt

brands <- c("weight", "hiland", "yoplait", "dannon")
yogurt2$brand <- factor(yogurt2$brand, levels = brands)
```

# Creating dummy coded variables

If you wish to make dummy-coded variables yourself to use them in a model, I recommend using the `dummy_cols()` function from the [{fastDummies}](https://github.com/jacobkap/fastDummies) package. For example, in the code below, I create dummy-coded columns for the `brand` variable and then use those variables as covariates in a model:

```{r}
yogurt2 <- fastDummies::dummy_cols(yogurt2, "brand")
```

The `yogurt2` data frame now has new dummy-coded columns for brand:

```{r}
head(yogurt2)
```

Now I can use those columns as covariates:

```{r}
mnl_pref_dummies <- logitr(
  data    = yogurt2,
  outcome = 'choice',
  obsID   = 'obsID',
  pars    = c(
    'price', 'feat', 'brand_yoplait', 'brand_dannon', 'brand_weight'
  )
)

summary(mnl_pref_dummies)
```