Interview key change

Our user Jennifer Goede has sent us the following question:

Can you explain to me when the interview key changes? I made a list of mistakes interviewers made, with their name and ED number. However when I look the interview key up it is now allocated to another interview.

An interview key is an 8-digit number assigned to interviews on the server (for example, 12-34-56-78). Every interview on a server has a unique interview key (even when they are in different surveys). If the interview is created on the server, its interview key will never change.

Yet, some interviews are created on tablets. Clearly there is a chance that two tablets will come up with the same number, and hence the same interview key. If that happens, the interview key of the first interview to reach the server will remain, while the second one will be changed.

This change can clearly occur only once, and at a particular time: at synchronization, when the interview reaches the server.

Chances of this happening are quite large. For a survey with 100,000 interviews you will have a noticeable number of interview key changes. So any program or script dealing with interview keys should be written with expectation that this will happen, rather than with a hope that this will not.

For working with exported data, match the records in different files not on interview__key, but on interview__id variable, which is stable (never changes).

1 Like

Thank you for the explanation.

Hi Sergiy,

A follow up to this question, Are the interview keys created completely at random? or is there any bias in the starting numbers, e.g. are certain interviewers given ranges of interview key or do earlier created interviews receive lower starting digits or anything like that?

Many thanks,

Lachlan

Hello @lachb , it is difficult to judge from the question whether this is something desirable, something that you want your project to rely on, or is it something undesirable, and what you’d rather avoid as this may be a risk. So please explain WHAT you want to do with the interview key.

In the above it was already written that “An interview key is an 8-digit number assigned to interviews on the server…” This definition shies away from mentioning randomness. In practice this means that this is something not promised, and even if it is random now, it may change in the future. There is nothing to hide though in an open-source project. You can see how the interview key is generated for yourself in the current version: https://github.com/surveysolutions/surveysolutions/blob/master/src/UI/Interviewer/WB.UI.Interviewer/Services/InterviewerInterviewUniqueKeyGenerator.cs , but I wouldn’t rely on ‘random nature’ of the interview key for any practical project, especially since even if it is “completely random” it doesn’t necessarily mean that it is “normally distributed” or “uniformly distributed” which could be important for things you want to do with it, which you didn’t describe.

Hi Sergiy,

Thank you for the detailed reply, very helpful. And apologies for the lack of detail.

We are investigating some different methods to conduct a listing exercise and then complete the random selection and full interviews completely in the field. We are aware it is obviously better to do the selection back at HQ but for various operational reasons this isn’t possible.

We have a combined the listing and full survey into one designer questionnaire with the full questionnaire sections being enabled by a supervisor level question. This is working nicely, the supervisor will review all of the completed listings for the EA, activate those required for full survey and reject back to interviewers to complete.

The question is how what is the best method to select x of the eligible cases listed at random for full interview. One idea is to use the ‘interviews’ tab in HQ for supervisors to first filter on the required EA. Then sort the cases by their interview_key and take the first 12 eligible for full survey. Thus the question about any inherent biases in the generation of the interview_key which might make this not random.

We have considered also an EA level form for the supervisors to complete which would make the selection but there are concerns about the level of entry error and potential missed cases which would likely occur here.

If you have any thoughts, or alternative methods for selection we’d love to hear.

Many thanks,

Lachlan

Hello @lachb ,

I see the following as contradictive:

First you wrote:

and then later:

Will the listing data be sent to the server/HQ from the EA before the selection is made? If yes, do the selection centrally.

Hi Sergiy, sorry bad choice of terms - here I meant the physical headquarters as in the central office, not the application :slight_smile:

As follows from this source file the value of the interview key is just:

Random.Next(99999999)

formatted to a "##-##-##-##" pattern.

If anything, I would be concerned about the interviewers behavior more than that of the software behavior in this case. Since the interviewers are seeing the interview keys, and being aware of the selection strategy, they would realize quite quickly that the interview 00-12-94-10 will have a higher chance of being selected than interviews 65-00-13-89 or 99-99-99-99. So if there is anything they can manipulate for their advantage, they will do that (e.g. manipulate the household size, eligibility, etc).

So despite any good statistical properties of the interview key, it’s usage for this purpose is far from being ideal.


The rest of the answer may be unnecessarily, but just in case. Early versions of Survey Solutions didn’t have a concept of an interview key at all, and it was added subsequently in an update, including, retrospectively to interviews collected to date. The strategies of assigning keys are somewhat different (when it comes to resolving key collisions), depending on the origin of the interview (existed at that time of the upgrade or a new one). The differences probably start to matter when the probability of a key collision is non-trivial (in the absence of any context, let’s assume it is something like: if your number of interviews is 1mln or more).

Given that it is very unlikely that you still retained interviews from a version predating 5.20, and that the number of interviews on your server is low, and the chances of a key collision in an EA are virtually non-existent, I wouldn’t be too much concerned about the possibility of a bias in the assigned interview key.

Here are a few reassuring stats on the distribution of the digits in the assigned keys (using 601 interview keys, all generated in a modern version of Survey Solutions):


. chitest digit, count sep(0)

observed frequencies of digit; expected frequencies equal

         Pearson chi2(9) =   7.6905   Pr =  0.566
likelihood-ratio chi2(9) =   7.5756   Pr =  0.577

  +---------------------------------------------------+
  | digit   observed   expected   obs - exp   Pearson |
  |---------------------------------------------------|
  |     0        481    480.800       0.200     0.009 |
  |     1        481    480.800       0.200     0.009 |
  |     2        528    480.800      47.200     2.153 |
  |     3        483    480.800       2.200     0.100 |
  |     4        478    480.800      -2.800    -0.128 |
  |     5        466    480.800     -14.800    -0.675 |
  |     6        497    480.800      16.200     0.739 |
  |     7        458    480.800     -22.800    -1.040 |
  |     8        460    480.800     -20.800    -0.949 |
  |     9        476    480.800      -4.800    -0.219 |
  +---------------------------------------------------+

and the Stata code used to obtain it:

clear all

use "interview__diagnostics.dta" 
generate keys=subinstr(interview__key,"-","",.)

forval i=1/8 {
	generate digit`i'=real(substr(keys,`i',1))
	label variable digit`i' "Digit in position `i'"
	histogram digit`i', d name(g`i') freq xlabel(0(1)9) color(stc1)
}

keep digit*
generate i=_n
reshape long digit, i(i) j(position)
label variable digit "Digit in any position"

histogram digit, d name(gtotal) freq xlabel(0(1)9) color(stc5)

graph combine g1 g2 g3 g4 g5 g6 g7 g8 gtotal, rows(3) scale(0.5)
chitest digit, count sep(0)