# FoR 2026 Dataset

This dataset is an anonymized derivative of the respondent-level data from the
SurveyUSA national poll (May 2026, n = 2,122 weighted to U.S. Census targets).

## Files

- `codebook.csv`: one row per column, giving its section, question text,
  response values, skip codes, and notes. You will want to read this first.
- `respondents.csv`: one row per respondent, 122 columns.

## Column naming

Columns are short codes from the original survey instrument (e.g. `D5j`,
`A3_conversations`, `F1a_az`). These codes are stable and convenient for
automation and indexing. The codebook maps them to question wording and
response labels.

Most analyses want the combined column for each question (e.g. `F1a`).
Columns ending in `_az` or `_za` hold only the respondents who saw the answer
choices in alphabetic or reverse-alphabetic order. They exist for order-effect
analysis and are otherwise null. The `qtype` column in the codebook flags
them as `RANDOMIZATION`.

An empty cell (no value, with nothing between the surrounding commas) always
means the respondent was not asked that question: either the column is a
randomization subset or the question was gated (e.g. the workplace questions
were presented only to respondents whose earlier employment answers qualified
them). The codebook notes specify which questions are contingent in this way.
There is no item nonresponse: respondents had to answer each question they
were shown before continuing. Most questions offered a "Not Sure" option (and
the trust questions offered an "N/A"). Some questions, such as the
demographics, required a definite choice.

## Weighting

Every respondent has a `weight` column. Use it if you want to compute an
estimate that is representative of the U.S. population.
Gender, age, race, and education are weighted to a strict 4D cross-product of
Census targets (96 cells), so unweighted counts will not match the figures
published on the website. The weights sum to a fixed total of 2,100.0. This
number is slightly below the 2,122 rows, but note that this scale factor
cancels out of every percentage (because it is used in the numerator and the
denominator), so it is really only the relative magnitude of the weights that
matters; results would be unaltered if all weights were doubled.

## How percentages are computed

Every published percentage is a weighted share:

    pct = (sum of `weight` over respondents who gave the response)
        / (sum of `weight` over the base set of respondents)

The base set of respondents is built in two stages. First, only respondents
who were asked the question are eligible. Anyone screened out by gating is
already absent, appearing as an empty cell rather than a response. Second,
among those who were asked, we drop respondents who selected "N/A" on the D5
trust items (the respondent does not use AI for that task) or who selected
"I Don't Use AI" on the workplace item B5. Those two options mean the question
did not apply, so those respondents should be removed from the denominator. The
codebook's `skip_code` column explicitly marks, for each question, the numeric
response codes that would lead to exclusion. The codebook's `response_values`
column tells you the response text corresponding to each numeric response code.

Every other response stays in the base set of respondents. In particular,
"Not Sure" is reported as its own response category, not treated as missing
data. If you compute a "percentage of substantive responses" (dropping the
"Not Sure" responses), you will produce percentages that are 5 to 10 points
above the figures published on the site.

Breakdowns by group (support by party, sentiment by age) apply the same
formula within each group: the base is the summed `weight` of group members
who answered the item, under the same skip rules. This is how the by-party
and by-experience figures are computed.

For select-all groups (qtype `SELECT_ALL`: A3, B2, C3, D4, dem_race), we
provide a separate Yes/No column for each option. An option's percentage is
Yes divided by the respondents who answered the group. For gated groups (e.g.
B2, asked only of employed people who use devices at work) that base is
smaller than all respondents: count the respondents with any non-null option
in the group. Where a question offered an "all of these" response option (D4),
the selection of "all of these" is already reflected in the responses for the
individual option columns; so "all of these" does not require any special
handling.

## Privacy

To anonymize the data, we removed direct identifiers (record IDs, timestamps,
regional markers such as a county code and ZIP code) and the free-text answers
(F1b, F2a, F2b; because open ended responses can occasionally be identifying).
Respondents provided their age in years, but we binned these responses. Census
region, census division, and parental status are also withheld.
