We occasionally encounter cases where we cannot perform search because a particular repo has failed. In that situation, the search UI does not allow to make any searches if that repo is selected.
What does "repo failed" mean?
The "readiness" status of service responsible for searching in a particular repo (indexsearcher) is kept by central searching service (merger) in "alive" field for each repo. If the "alive" status of a repo is false in the config file of merger service, failed repo issue is seen when making search in that repo. This could happen if index searcher service for that repo is not running as expected or if the config file of the merger service is not updated according to the status of indexsearcher service.
Mitigation:
Whenever we get repo failed for a particular repo then it is always wise to check the logs for the indexsearcher service of that repo.
tail -f -n 50 /opt/immune/var/log/service/indexsearcher_<repo_name>/current#replace <repo_name> with actual repo name. for e.g. for repo with name "Windows",the command will betail -f -n 50 /opt/immune/var/log/service/indexsearcher_Windows/current
The above command will output the last 50 logs for the indexsearcher service of the particular repo.
You can also check if indexsearcher is replying back to alive probe with tcpdump on query.source.socket.
grep "query.source.socket" /opt/immune/etc/config/indexsearcher_<repo_name>/config.jsontcpdump -i any port <query.source.socket port> -Aq
If indexsearcher is alive, you should see {"isalive":true}
and {"alive":true}
messages
If there are no errors in the tail
command and "alive":true
messages are being seen in tcpdump commands but the failed repo error is still being seen with search, try checking alive status in merger config.
grep -B1 alive /opt/immune/etc/config/merger/config.json
Potential Scenarios
- Indexsearcher service is recently restarted
If an indexsearcher service for a repo is just restarted then for large repos it takes few minutes to scan metadata for stored indexes, before searches can be served. During that period, the repo failed error is observed. - LogPoint machine is recently rebooted
If a LogPoint machine is recently rebooted then the indexsearcher services take time to initialize services. During those few minutes, repo failed error can be seen for some repos.
- Issue in the indexsearcher service
If there is some error in the indexsearcher service, then repo failed issue does not resolve on its own.
In such scenarios, please review the logs of the indexsearcher service as mentioned above. It is recommended to create a support ticket in such scenarios for further investigation and resolution of the problem. It will be helpful to include the service log of indexsearcher service of that repo in the ticket. The log file is located at/opt/immune/var/log/service/indexsearcher_<repo_name>/current