EPSO Data Science | Field-related MCQ resources

EPSO Data science candidate reviewing statistical charts, data visualisations and model dashboards in a modern EU institution-style office.

This study guide is for candidates preparing for the EPSO ICT competition, profile 4: Administrators (AD7) in the field of data science: EPSO/AD/429/26 - 4.

It accompanies EU Training’s field-related Data Science practice questions and brings together the main topics and free learning resources used when creating the question set.

The 2026 EPSO ICT AD7 competition has four separate fields. This guide focuses only on Field 4: Data Science.

The field-related multiple-choice test is taken in your language 2. It contains 30 questions, lasts 40 minutes and has a pass mark of 15/30. Because this test is used for ranking, your goal should not be to scrape past the threshold. You need to prepare for applied data-science questions under time pressure.

At AD7 level, Data Science is not just statistics, Python or machine learning. The profile is aimed at experienced professionals who understand data architecture, data engineering, modelling, machine learning, semantic technologies, MLOps, visualisation, governance and expert communication.

The practice questions are designed to test applied judgement. You may see conceptual distinctions, small computations, architecture choices, production-design questions, modelling pitfalls and governance scenarios.

A strong candidate should be able to distinguish:

ETL from ELT
accuracy from calibration
a data lake from a data warehouse
taxonomy from ontology
batch processing from streaming
a dashboard from a governed metric definition
model accuracy in a notebook from production readiness
a bigger model from the real analytical fix

Use this guide as a study route. Start with the official EPSO competition notice, then work through the topic areas below and focus most on the areas where your practice results are weakest.

The EPSO ICT AD7 Notice of Competition

Always start with the official EPSO Notice of Competition. It is the only legally binding source for the competition rules, test structure, eligibility requirements and field duties.

For Data Science, the Notice describes a broad technical and analytical role. It covers designing data architectures, pipelines and large-scale data processing systems; applying statistical modelling, machine learning and advanced analytics; developing data integration solutions, APIs, visualisation tools and automated insights; and implementing MLOps, data governance and high-performance computing solutions.

NoC link: https://eur-lex.europa.eu/eli/C/2026/2425/oj

EU Training's practice questions follow that scope. They are not limited to model training. They also cover ingestion, distributed processing, semantic modelling, statistical reasoning, API design, production model governance, dashboards, self-service analytics and communication with non-specialist stakeholders.

A frequent trap in this profile is choosing a technically impressive answer that does not fix the problem in the question. A larger cluster will not fix data leakage. A neural network will not fix selection bias. A better chart will not fix an inconsistent metric definition.

As you practise, keep asking:

What problem is the question really testing?
Is this a data, model, pipeline, architecture, governance or communication issue?
Does the answer fix the cause, or only improve the surrounding process?
Is the result valid for the decision being made?
Is the metric defined consistently?
Is the model ready for production, or only working in development?

The aim is not to memorise tool names. It is to recognise the data-science judgement EPSO may test under time pressure.

Main topic areas

The EU Training Data Science practice set contains 200 questions across eight topic areas. These areas reflect the breadth of the EPSO profile, so the questions are not limited to one tool, one programming language or one type of model.

Use this section as a study map. For each topic, focus on the practical distinction being tested, not just the terminology.

Data engineering and distributed data ecosystems

This area covers ETL and ELT, change data capture, streaming, data lakes, warehouses, lakehouses, distributed joins, skew, idempotency, orchestration, small files, data contracts and quality checks.

Think of pipelines as systems that change state. Ask whether a retry is safe, whether a job is full-refresh or incremental, whether data is late or duplicated, and whether the bottleneck is computation, shuffle, file overhead or governance.

Advanced statistical modelling and quantitative analysis

This area covers overfitting, leakage, train-test splits, calibration, p-values, confidence and prediction intervals, bias, variance, sampling, outliers, forecasting and regression diagnostics.

Focus on what is actually being estimated. Accuracy may hide class imbalance. A small p-value is not the same as effect size. A randomly shuffled split is wrong for forecasting. A confidence interval for a mean is not a prediction interval for a new observation.

Data architecture and semantic technologies

This area covers RDF, SPARQL, SKOS, ontologies, taxonomies, URIs, DCAT, knowledge graphs, controlled vocabularies, metadata, linked data and semantic interoperability.

The key is understanding the structure behind the terms. RDF uses subject, predicate and object. A taxonomy gives hierarchy. An ontology can express richer relationships and constraints. Shared identifiers and vocabularies make data easier to compare across systems, organisations and countries.

HPC, GPU analytics, parallel processing and optimisation

This area covers Amdahl’s law, memory-bound versus compute-bound workloads, GPU fit, strong scaling, load imbalance, barriers, checkpointing, message passing, roofline analysis and communication overhead.

Start with the bottleneck. A GPU helps regular data-parallel work, not every workload. More processors help only until serial work, communication or I/O dominates. The right optimisation depends on what is actually limiting performance.

Advanced data integration and API development

This area covers REST design, idempotency keys, pagination, contract testing, events, queues, data virtualisation, change events, API versioning, retries and integration reliability.

Think about what happens when something fails. A payment retry must not double-charge. A duplicate event must not double-count. A consumer contract should catch breaking changes before release. Pagination prevents one large response from becoming a service problem.

MLOps, automated ML and production model operations

This area covers model registries, feature stores, drift, monitoring, training-serving skew, versioning, rollback, approval, explainability, human oversight and responsible deployment.

A model that works in a notebook is not automatically production-ready. You need data lineage, training reproducibility, model versioning, monitoring, alerting, rollback options and a plan for drift and retraining.

Analytics, visualisation and self-service insight delivery

This area covers dashboards, KPIs, chart choices, uncertainty communication, metric definitions, aggregation traps, self-service analytics and the risk of misleading decision-makers.

A chart is not neutral. Ask what decision the visual is meant to support, whether the denominator is clear, whether the metric is defined consistently, and whether uncertainty or missing data must be shown.

Data governance, stewardship and expert communication

This area covers data classification, stewardship, lineage, ownership, access, quality accountability, open data, personal data, controlled vocabularies and communication to senior audiences.

Governance is not bureaucracy for its own sake. It makes data usable, trusted and safe. Strong answers often name an owner, define a standard, expose lineage, classify sensitivity or explain uncertainty clearly to non-specialists.

How to read the questions

EU Training's Data Science questions are deliberately cross-disciplinary. You may need to recognise an engineering problem inside a modelling scenario, a governance weakness inside a dashboard question, or a communication issue inside a technically correct analysis.

The correct answer is usually the one that fixes the real analytical, architectural or governance cause. A technically valid answer can still be wrong if it solves a different problem.

When reviewing missed questions, write down the decisive distinction in one line. For example:

ETL versus ELT
accuracy versus calibration
confidence interval versus prediction interval
taxonomy versus ontology
batch versus stream
data lake versus warehouse
leakage versus overfitting
correlation versus causation
model performance versus production readiness
better visualisation versus better metric definition

For mathematical questions, keep the calculation simple and then spend time on interpretation. EPSO-style specialist questions often care as much about the conclusion as about the formula.

Free learning resources for Data Science

Laptop displaying EPSO data science dashboards and statistical visualisations study resources on a desk with study notes in a modern EU office setting.

The resources below are free to read or study. They were selected because they support the kinds of questions covered in EU Training’s Data Science practice questions.

You do not need to read everything from start to finish. Thoroughly read the Notice of Competition first, especially the qualifications and duties required for your profile, then use these other resources based on your weaker areas.

Official competition source

EU Careers ICT AD7 Notice of Competition

Start with the official source for the competition where you can study the duties required for your profile. Keep an eye on your EPSO profile page too for important updates.

NoC link: https://eur-lex.europa.eu/eli/C/2026/2425/oj

EPSO profile page link: https://eu-careers.europa.eu/en/job-opportunities/data-science

Statistics and machine learning foundations

OpenIntro Statistics

A free statistics textbook with clear explanations of distributions, inference, regression and uncertainty. Useful for refreshing statistical reasoning.

Link: https://www.openintro.org/book/os/

Google Machine Learning Crash Course

A free practical introduction to machine learning, evaluation, overfitting, validation, embeddings and model interpretation.

Link: https://developers.google.com/machine-learning/crash-course

scikit-learn User Guide

Useful for supervised learning, model selection, cross-validation, preprocessing, metrics and pipelines.

Link: https://scikit-learn.org/stable/user_guide.html

Reproducible and responsible data science

The Turing Way

A free handbook on reproducible, ethical and collaborative data science. Useful for governance, documentation, reproducibility and responsible practice.

Link: https://book.the-turing-way.org/

Distributed processing and data engineering

Apache Spark documentation

Official documentation on distributed processing, SQL, streaming and large-scale data workloads.

Link: https://spark.apache.org/docs/latest/

MLOps and production model operations

MLflow documentation

Useful for experiment tracking, model registry, deployment and model lifecycle management.

Link: https://mlflow.org/docs/latest/index.html

Google MLOps guide

A practical article on continuous delivery and automation pipelines for machine learning. Useful for production model operations.

Link: https://docs.cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

Semantic technologies and linked data

W3C RDF 1.1 Primer

A free primer on RDF triples and linked data. Useful for semantic-web questions.

Link: https://www.w3.org/TR/rdf11-primer/

W3C SPARQL 1.1 Query Language

The W3C specification for querying RDF graphs. Use it for the basic idea of triple-pattern matching and variable bindings.

Link: https://www.w3.org/TR/sparql11-query/

W3C SKOS Primer

Useful for controlled vocabularies, broader and narrower concepts, concept schemes and mappings.

Link: https://www.w3.org/TR/skos-primer/

W3C DCAT 3

A vocabulary for describing datasets and data catalogues on the web.

Link: https://www.w3.org/TR/vocab-dcat-3/

DCAT-AP 3.0

The European application profile for data portals. Useful for public-sector metadata and data-catalogue interoperability.

Link: https://semiceu.github.io/DCAT-AP/releases/3.0.0/

Open data and public-sector data literacy

data.europa.eu Academy

Free learning material and webinars on open data, data portals, reuse and data literacy in the European public-sector context.

Link: https://data.europa.eu/en/academy

Quick recap

The points below recap the most useful practice habits from this guide. Use them after each question block to make sure you are reviewing actively, not just checking your score.

Work through one topic area at a time.
After each question block, write down the distinction that decided the answer.
For plausible wrong answers, note why they were tempting but incorrect.
For calculations, write the formula, do the maths, then explain what the result means.
For action-based questions, ask what actually fixes the problem in the question.
Check whether the answer addresses the data, model, pipeline, architecture, governance or communication issue.
Focus on the data-science principle behind the answer, not just the correct option.