currently stumbling over an issue where I am unsure if the following post is a bug report, a feature/enhancement request or I am just doing something wrong.
Assume a huge survey/census with hundred thousands of interviews. I want to build up an “external database” through frequent small batches of data exports from Survey Solutions since I need the full census data updated very often for some other external processes.
Instead of requesting the full batch of interviews each time, which would take a lot of time and presumably has high loads on the server (right?), I only want to export interviews that have changed/been updated since the last data export. To this end, I thought making use of the “From” Value in the body of the request. Per documentation, this is to export only interviews which were changed after the indicated date-time (UTC).
The issue: Interviews that were COMPLETED before that date-time but only synced after (i.e. updated) it, are not included in the export file.
What steps will reproduce the problem?
- Create an interview in INT App and complete it at 2022-05-11T12:15:00.000Z UTC
- Wait 5 minutes and sync the Interview at 2022-05-11T12:20:00.000Z UTC
In List of Interviews in HQ UI, the interview will be listed as 'Updated on: May 11, 2022 12:20"
Status History will show that last change was done at May 11, 2022 12:15
- Create a POST request to start an export file with
"From": "2022-05-11T12:17:00.000Z", i.e. just inbetween completion and syncing of interview
- Download the file
What is the expected/desired result?
Interview createad at 12:15 is included in the export file. While it was changed at 12:15 for last time, it was received on the server (“updated”) for the first time at 12:20
What happens instead?
Interview file is not included
- Application: Headquarters
- Version: 22.02.6 (build 32441)
Very interested to hear your thoughts!
Hello @peter_brueck ,
- you are not doing anything wrong. Requesting export with a from+to filter is a perfectly fine query in the API;
- you are not reporting a bug, the described behavior corresponds to my understanding of what the change (in interview data) is. Taking as a starting point that the interview data is a sequence of edits from questionnaire to the current state, it shouldn’t matter on which machine these edits are undertaken.
- imho, you are writing a feature request where there should be a different criterion for selection of interviews to be included into the export data.
IMHO you are trying to include the timestamps technical activities, such as data transfer into the changes to the content of the interview. Also imho, that would require for every event that takes place to record two timestamps: one when the event actually occurred, and another one to denote when the server was notified about this event happening. Keep in mind that while we tend to think of ‘completion’ or ‘answering’ as instantaneous events, we can hardly say that about synchronization - it may take 5 days to synchronize an interview, and different events of the interview may arrive at different times due to partial synchronization. To decide whether a certain interview is to be included or not would require to keep a trace for each event of when the server has received it. Things get more complicated if we think about multiple layers, such as presence of tablet supervisor. (when did the event itself occur? when was the tablet supervisor notified about that? when was the server notified about that?). This is regarding the feature request.
Regarding the actual task at hand, most certainly more than one of my colleagues have been dealing with the same task before and can provide more specific recommendations on how to achieve this. I think it is advisable to maintain discipline and align the synchronization and reporting frequencies. For example, if you plan to build daily reports during your 3 months long survey, tell your interviewers to synchronize hourly. Building daily reports when your interviewers are synchronizing weekly is setting up a trouble.
I’d also advise to actually get some measurements done to understand the performance of the server. It may actually produce the output faster than you think. But, then there is also an issue of how quickly you can put that data into the external DB for external processes to consume.
If it is any consolation, we had to solve about the same problem when building the reports in the HQ. If an interview was collected on Monday, but sent to the server on Wednesday, to which cell of the quantity report does it belong? Monday or Wednesday? If you’ve answered Monday - your reports can have back changes. You’ve provided a report about Monday to your boss on Tuesday, but then on Wednesday he will see that the Monday data is changing. Weird, right? If you’ve answered Wednesday, then what does this report actually reports? Load of work performed in a particular day? Or throughput of networks and quality of signal and synchronization frequency? - which are the factors affecting how quickly the completed interviews are reaching the server.
Thanks Sergiy for your elaborate response!!
I hereby would then like to propose a new feature
Adding another filter to the request body of the
POST /api/v2/export query:
Where this filter reflects the started date for timeframe of exported interviews (when last update was done to an interview). From naive enduser perspective, it appears that the HQ Application has this value stored somewhere already as it is displayed in the “Interviews” List. If I see it correctly, the use-case scenario I describe above would then be covered with this filter.
Having said that, you are right with
For now I will actually just stick to the regular full batch export and closely discuss this with the IT-team w.r.t measurement and performance. Maybe it isn’t such a big pain. I just wanted to “annoy” the server as least as possible. Will try to report back with experience & insights!
If any other users have experience with large scale surveys (e.g. 200k+ Interviews 1.5 hour long interview within days) and full survey export, interested to hear your thoughts & recommendations.
Could you please elaborate on this?
When I spoke about date-time of last “Update” I was referring to the value displayed to HQ Users on the Interview Tab, column ‘Updated On’, see below:
This value in my scenario above is the date-time of the first sync to server.
So, assuming this value is accessible in an easy way, the feature request could be a low-hanging fruit? Not knowing any of the nightmare behind changing these things of course
To build on @peter_brueck 's post, it would be good to have accessible the dates an interview:
Was first completed by an interviewer. This can be computed from
interview__actions, but would be nice to have pre-computed.
Was last completed by an interviewer. This can also be computed from
Was first posted to the server. The time updated will change, I believe, each time an interview is acted upon by any user (e.g., Supervisor rejects → update, Interviewer syncs again → update, Supervisor approves → update, Headquarters approves → update). If one wants to see how recently interviewers have synched interviews to the server, it’s useful to have this type of data. To my knowledge, this data can only really be found by parsing interview action logs to recover the dates of sync attempts.
I just saw this post now.
In the context of a census of many hundreds of thousands of interviews I have a similar situation.
In my daily workflow I disregard the date-last-updated and instead rely on the status of the interview.
Only Completed interviews are processed (in your case exported). The workflow then changes the status of completed interviews to (automated) approved or rejected. This way I don’t miss out on interviews coming in late. Could this be a way to handle your situation?
Thanks for your insights Klaus. Your scenario makes sense, though it would only be a work around/additional query for me, since, as of now at least, we do not plan to make use of the interview status (e.g. approving/rejecting, neither automated or manual).
But if pilot/IT-team flags any issues re full data export, I might think about this again! Thanks!