Data-driven policy making aims to make use of new data sources and new techniques for processing these data and to realize new policies, involving citizens and other relevant stakeholders as data providers.sentence permalink
Policy making typically would like to rely on open (and free) data. One case is that the needed data is simply not collected and the level of granularity needed or targeted. Another case is that data is actually collected but since it is an asset by those who collect and detain it, it is not shared or shared at a (high) cost.sentence permalink
Clearly it is related to the notion of evidence-based policy making, which considers relevant the inclusion of systematic research, program management experience and political judgement in the policy making process (Head 2018). The concept of evidence-based policy making implies that the logic of intervention, impact and accountability are accepted and considered a key part of the policy process.sentence permalink
However, data-driven policy making stresses the importance of big data and open data sources into policy making as well as with co-creation of policy by involving citizens to increase legitimacy (Bijlsma et al 2011) and decrease citizens’ distrust in government (Davies 2017). In this respect, data availability is of great importance, but even more is data relevance. You can have nice (big) open data sources (e.g. Copernicus, weather data, sensor data). But policy making asks for totally different assetssentence permalink
Policy making is conceptualized as a policy cycle, consisting of several different phases, such as agenda setting, policy design and decision making, policy implementation, monitoring and evaluationsentence permalink
It has also to be taken into account that the evaluation phase can be considered as a continuous and horizontal activity which has to be applied in all other policy cycle stages. Therefore, we can talk about E-Policy-Cycle (Höchtl et al. 2015)sentence permalink
What follows is a presentation of the use of big data in the different phases of the policy cycles, as well as the challenges of the policy making activities in which big data can be exploited.sentence permalink
Phase 1 – Agenda settingsentence permalink
The challenge addressed is to detect (or even predict) problems before they become too costly to face. Clearly the definition of what is a “problem” to be solved has a political element, not just analytical (Vydra & Klievink, 2019)sentence permalink
One traditional problem of policy making is that data and therefore statistics become available only long time after the problems have emerged, hence increasing the costs to solving themsentence permalink
Alternative metrics and datasets can be used to identify early warning signs at an earlier stage and lower costs helping to better understanding causal links. In this respect, there should be a clear effort to use the data available at the real Big Data owners and merchants - Google, Facebook, Amazon and Apple. The big corporations should to pay their taxes in data - not just with cash. Regarding alternative metrics, the research departments at Microsoft and Google have made significant advances in this area (see Stephens-Davidowitz, available at http://sethsd.com)sentence permalink
Moreover according to Höchtl et al. (2016, p. 159) governments can identify emergent topics early and to create relevant agenda points collecting data from social networks with high degrees of participation and identifying citizens’ policy preferences. Clearly, using data from social networks needs a big amount of data cleaning and quality check. In that regard, dedicated discussion spaces (e.g. Opinion Space in the past) ensure better quality.sentence permalink
Overall, it is necessary a mediation between traditional objectives and boundaries with information from big data and participative democracy. Relevant for this problem are optimization methods, techniques like linear and non linear programming, as well as how to combine big data and economic planningsentence permalink
Phase 2 – Policy Agenda * Big data and data analytics solutions can be used for providing evidence for the ex ante impact assessment of policy options, by helping to predict possible outcomes of the different optionssentence permalink
Big data can also help in analysing the current situation using back-casting techniques. If the data are available, researchers can analyse the already existing previous data and figure out where a failed policy went wrong or where a successful one owes its successsentence permalink
Furthermore, in principle if policies could be compared, counterfactual models could do the trick. However, it is not easy to find systems expected to have the exact same behaviour when talking on political decisionssentence permalink
In this regard, Giest (2017) argues that the increased use of big data is shaping policy instruments, as “The vast amount of administrative data collected at various governmental levels and in different domains, such as tax systems, social programs, health records and the like, can— with their digitization— be used for decision-making in areas of education, economics, health and social policy”sentence permalink
Clearly there is a strong potential to use public data for policy-making, but that does not come for free. There is a obvious wealth of public sector data, but tt needs to be structured and "opened" for other uses, also considering the data protection issue, also in light of the new regulatory frameworks such as the GDPR.sentence permalink
Phase 3 – Policy Implementationsentence permalink
Big data and data analytics can help identifying the key stakeholders to involve in policy or to be targeted by policiessentence permalink
One way in which big data can influence the implementation stage of the policy process is the real-time production of datasentence permalink
Clearly this implies that data is available and usable. Frameworks and platforms are needed for this and there is the challenge of the local rooting of public sector bodies and their operations: this is not homogenous across the EU or the world. Also privacy and security issues must be taken into accountsentence permalink
The execution of new policies immediately produces new data, which can be used to evaluate the effectiveness of policies and improving the future implementationsentence permalink
Testing a new policy in real time can provide insights whether it has the desired effect or requires modification (e.g. Yom-Tov et al. 2018). However, one has to account for an adjustment period, thus the effects observed immediately after the policy was put into effect might not be representative of its long-term consequencessentence permalink
Furthermore, big data can be used for behavioral insightssentence permalink
Phase 4 – Policy Evaluationsentence permalink
Big data and data analytics approaches can help detecting the impact of policies at an early stage (Höchtl et al., 2016, p. 149), before formal evaluation exercises are carried out, or detecting problems related to implementation, such as corruption in public spendingsentence permalink
In that regard, formal/structured evaluation mechanisms are complementary to (big) data analytics approachessentence permalink
Most importantly, big data can be used for continuous evaluation of policies, to inform the policy analysis process, while even empowering and engage citizens and stakeholders in the process (Schintler and Kulkarni 2014, p. 343)sentence permalink
Do you agree with these definitions? Do you want to add anything (please add comments in the definitions above)?sentence permalink
How can citizens participate in each of the phases above? How can co-creation for data-driven policy making be realized (please add comments in the definitions above)?sentence permalink
Which (technical, organizational, legal) requirements need to be met to enable the use of big data in each phase (please add comments in the definitions above)?sentence permalink
What are the obstacles and bottlenecks for the use of Big Data in each phase (please add comments in the definitions above)?sentence permalink
Key challenges of data-based policy making in which big data can be useful:sentence permalink
Anticipate detection of problems before they become intractable;sentence permalink
Generate a fruitful involvement of citizens in the policy making activity;sentence permalink
Making sense of thousand opinions from citizens;sentence permalink
Uncover causal relationships behind policy problems;sentence permalink
Identify cheaper and real-time proxies for official statistics;sentence permalink
Identify key stakeholders to be involved in or target by specific policies;sentence permalink
Anticipate or monitor in real time the impact of policies.sentence permalink
Which big data methodologies can be used to cope with any of the above challenges (please add comments in the lines above)?sentence permalink
Development of new evaluation frameworks and tools for the assessment of the impact of policies. Such evaluation frameworks should build on a set of evaluation criteria and indicators adapted to the specific domainssentence permalink
Development of new procedures and tools for the establishment of a management system integrating both, financial and nonfinancial performance information linked with quality data, impact measurement and other performance indicatorssentence permalink
Development of new tools, methodologies and regulatory frameworks to boost participation of citizens in policies making by mean of crowdsourcing and co-creation of policies, in the view to define stances and to being able to differentiate complaints from critiquessentence permalink
Development of new regulations, tools and technical frameworks that ensure absence of bias and transparency in the policy making process and cybersecurity of IT systems in the public administrationsentence permalink
Development and deployment of frameworks and tools that allow the secure sharing of information and data within the public administration, as well as the interoperability of systems and databases. These frameworks include the standardization of organizational processessentence permalink
Development of specific interoperable cloud infrastructures and (re-usable and integrating) models for the management and analysis of huge volumes of datasentence permalink
Development of new regulations, tools and technical frameworks that ensure absence respect of citizens’ privacy and data ownership/security, especially in case the personal information need to be migrated across public administration agenciessentence permalink
Development and establishment of a unique reliable, secure and economically sustainable technical and IT infrastructure which would work as a backbone for all the public services developed and implemented in the public sectorsentence permalink
Development of information management systems and procedures for the collection, storing, sharing, standardization and classification for information pertaining to the public sectorsentence permalink
Development of analytical tools to understand the combined contribution of technological convergence. For instance, how technologies such as AI, Blockchain and IoT may be combined to offer super-additive solutions for evidence-based policy makingsentence permalink
Development of new analytical tools to support to problem setting: ability to fully understand the policy issue you are trying to tackle in its entirety and its key fundamental processessentence permalink
Do you agree with such gaps/research needs?sentence permalink
Do you have any other gap/research need to add?sentence permalink
Can you propose any solution to such gap/research need?sentence permalink
We define six main research clusters related to the use of Big Data in policy making. Four of them are purely technological and build on the Big Data Cycle, while two are of a more legal and organizational nature. The research clusters are the following:sentence permalink
Cluster 1 - Privacy, Transparency and Trustsentence permalink
Even more than with traditional IT architectures, Big Data requires systems for determining and maintaining data ownership, data definitions, and data flows. In fact, Big Data offers unprecedented opportunities to monitor processes that were previously invisible.sentence permalink
In addition, the detail and volume of the data stored raises the stakes on issues such as data privacy and data sovereignty. The output of such research cluster includes a legal framework to ensure ownership, security and privacy of the data generated by the user while using the systems in the public administration.sentence permalink
A second facet of this research cluster is transparency in the policy making process and availability of information and data from the public administration, which is also related to the ability to collect sufficient data, which is not a given, especially when dealing with local public administrations. Concerning the transparency in the policy making process, computer algorithms are widely employed throughout our economy and society to make decisions that have far-reaching impacts, including their applications for education, access to credit, healthcare, and employment, and therefore their transparency is of utmost importance.sentence permalink
On the other side ubiquity of algorithms in everyday lives is an important reason to focus on addressing challenges associated with the design and technical aspects of algorithms and preventing bias from the onset.sentence permalink
A crucial element, which is taking more and more importance in the last decade, is the practice of co-creating public services and public policies with citizens and companies, which would make public services more tailored to the needs of citizens and would open the black box of the inner working of public administration.sentence permalink
In the context of big data, co-creation activities take the form of citizen science-like activities such as data creation on the side of citizens, and in the co-creation of service in which disruptive technologies such as big data are adopted.sentence permalink
In that regard, following Wood-Bodley (2018), harnessing the rich and valuable insights and experience of people in non-policy roles is essential to building fit-for-purpose solutions.sentence permalink
An interesting research avenue that is gaining importance is the co-creation of the algorithms that are used in policy making, especially through serious games and simulations. Finally, openness and availability of government data for re-use provides the possibility to check and put under scrutiny the policy making activity (e.g. the UK-oriented initiative of My2050).sentence permalink
Cluster 2 – Public Governance Framework for Data Driven Policy Making Structuressentence permalink
The governance concept has been on the roll for the last couple of years. But, what is the governance concept actually about and how can it be applied for the present purpose? Generally, the governance notion stands for shaping and designing areas of life in the way that rules are set and managed in order to guide policy-making and policy implementation (Lucke and Reinermann, 2002).sentence permalink
Core dimensions of governance are efficiency, transparency, participation and accountability (United Nations, 2007). Corresponding to the definition of electronic governance, evidence-based and data-informed policy-making in the information age applies technology in order to efficiently transform governments, their interactions with citizens and the relationship with citizens, businesses, other stakeholders, creating impact on the society (Estevez and Janowski, 2013).sentence permalink
More concrete, digital technologies are applied for the processing of information and decision-making, the so called smart governance approach is applicable here (Pereira et al., 2018). In this frame, governance has to focus on how to leverage data for more effective, efficient, rational, participative and transparent policymaking. Although the governance discussion is not the newest one, it remains complex challenge in the era of digital transformation.sentence permalink
Cluster 3 - Data acquisition, cleaning and representativenesssentence permalink
Data to be used for policy making activity stem from a variety of sources: government administrative data, official statistics, user-generated web content (blogs, wikis, discussion forums, posts, chats, tweets, podcasting, pins, digital images, video, audio files, advertisements, etc.), search engine data, data gathered by connected people and devices (e.g. wearable technology, mobile devices, Internet of Things), tracking data (including GPS/geolocation data, traffic and other transport sensor data), and data sources collected through participation of citizens science activities.sentence permalink
This leads to a huge amount of data that can be used and are of an increased size and resolution, span across time series, and that they are not, in most cases, collected by means of direct elicitation of people. However, concerning data quality, a common issue is balance between random and systematic errors. Random errors in measurements are caused by unknown and unpredictable changes in the measurement. In that regard, the unification of data so as to be editable and available for policy making is of extreme importance: cancelling noise for instance is challenging.sentence permalink
These changes may occur in the measuring instruments or in the environmental conditions. Normally random errors tend to be distributed according to a normal or Gaussian distribution. One consequence of this is that increasing the size of your data helps to reduce random errors. However, this is not the case of systematic errors, which are not random and therefore they affect measurements in one specific way. In this case, errors are from the way how data are created and therefore very large datasets might blind researchers to this kind of errors.sentence permalink
Besides the potential presence of systematic errors, there are two more methodological aspects of big data that require careful evaluation: the issue of representativeness and the construct validity problem.sentence permalink
For this reason, any known limitations of the data accuracy, sources, and bias should be readily available, along with recommendations about the kinds of decision-making the data can and cannot support. The ideal would be a cleansing mechanism for reducing the inaccuracy of the data to the smallest extent, though, especially in case this can be predicted beforehand.sentence permalink
Cluster 4 - Data storage, clustering, integration and fusionsentence permalink
This research cluster deals with information extraction from unstructured, multimodal data, heterogeneous, complex, or dynamic data. Heterogeneity and incomplete data must be structured prior to the analysis in an homogeneous way, as most computer systems work better if multiple items are stored in an identical size and structure. But an efficient representation, access and analysis of semi-structured data is necessary because a less structured design is more useful for certain analysis and purposes.sentence permalink
Specifically, the large majority of big data, from the most common such as social media and search engines data to transactions at self-check out in hotels or supermarkets, are generated for different and specific purposes. They are not the design of a researcher that elicits their collection with in mind already an idea of a theoretical framework of reference and of an analytical strategy. Specifically regarding data from social media, they can be really challenging to clean and demand a lot of effort. What is more, the data elicited from social media could be biased.sentence permalink
In this regard, repurposing of data requires a good understanding of the context in which the data repurposed were generated in the first place, finding a balance between identifying the weaknesses of the repurposed data and at the same time finding their strengths.sentence permalink
In synthesis, the combination and meaning extraction of big data stemming from different data sources to be repurposed for another goal requires the composition of teams that combine to types of expertise: data scientists, which can combine different datasets and apply novel statistical techniques; domain experts, that help know the history of how data were collected and can help in the interpretation.sentence permalink
Clearly, a pre-requisite of clustering, integration and fusion is the presence of tools and methodologies to successfully store and process big datasentence permalink
Cluster 5 - Modelling and analysis with big datasentence permalink
Despite the recent dramatic boost of inference methods, they still crucially rely on the exploitation of prior knowledge and the problem of how those systems could handle unanticipated knowledge remains a great challenge.sentence permalink
In addition, also with the present available architectures (feed-forward and recurrent networks, topological maps, etc.) it is difficult to go much further than a black-box approach and the understanding of the extraordinary effectiveness of these tools is far from being elucidated. Given the above-mentioned context it is important to make steps towards a deeper insight about the emergence of the new and its regularities.sentence permalink
This implies conceiving better modelling schemes, possibly data-driven, to better grasp the complexity of the challenges in front of us, and aiming at gathering better data more than big data, and wisely blending modelling schemes. But we should also go one step further in developing tools allowing policy makers to have meaningful representations of the present situations along with accurate simulation engines to generate and evaluate future scenarios.sentence permalink
Hence the need of tools allowing for a realistic forecast of how a change in the current conditions will affect and modify the future scenario. In short scenario simulators and decision support tools. In this framework it is highly important to launch new research directions aimed at developing effective infrastructures merging the science of data with the development of highly predictive models, to come up with engaging and meaningful visualizations and friendly scenario simulation engines.sentence permalink
Taking into account the development of new models, there are basically two main approaches: data modelling and simulation modelling. Data modelling is a method in which a model represents correlation relationships between one set of data and the other set of data. On the other hand, simulation modelling is a more classical, but more powerful, method in which a model represents causal relationships between a set of controlled inputs and corresponding outputs.sentence permalink
Cluster 6 - Data visualizationsentence permalink
Making sense and extracting meaning of data can be achieved by placing them in a visual context: patterns, trends and correlations that might go undetected in text-based data can be exposed and recognized easier with data visualization software.sentence permalink
This is clearly important in a policy making context, in particular when considering the problem setting phase of the policy cycle and the visualization of the results of big data modelling and analysis. Specifically, new techniques allow the automatic visualization of data in real time. Furthermore, visual analytics allows to combining human perception and computing power in order to make visualization more interactive.sentence permalink
How can big data visualization and visual analytics help policy makers? First, generate high involvement of citizens in policy-making. One of the main applications of visualization is in making sense of large datasets and identifying key variables and causal relationships in a non-technical way. Similarly, it enables non-technical users to make sense of data and interact with them.sentence permalink
Further, good visualization is also important in "selling" the data-driven policy making approach. Policy makers need to be convinced that data-driven policy making is sound, and that its conclusions can be effectively communicated to other stakeholders of the policy process. External stakeholders also need to be convinced to trust, or at least, consider data-driven policy-making.sentence permalink
There should be a clear and explicit distinction of the audiences for the policy visualisations: e.g. experts, decision makers, the general public. Experts are analyzing data, are very familiar with the problem domain and will generate draft policies or conclusions leading to policies Decision makers may not be technical users, and may not have the time to delve deep into a problem. They will listen to experts and must be able to understand the issues, make informed decisions and explain why. The public needs to understand the basics of the issue and the resulting policy in a clear manner.sentence permalink
A second element is that visualization help to understand the impact of policies: visualization is instrumental in making evaluation of policy impact more effective. Finally, it helps to identify problems at an early stage, detect the “unknown unknown” and anticipate crisis: visual analytics are largely used in the business intelligence community because they help exploiting the human capacity to detect unexpected patterns and connections between data.sentence permalink
Do you agree with this set of research clusters?sentence permalink
Do you want to add any other?sentence permalink
Do you want to merge any cluster?sentence permalink
Do you think that they cover the entire big data chain and/or policy cycle?sentence permalink
For each research cluster we defined an initial set of research challenges.sentence permalink
Cluster 1 – Privacy, Transparency, Ethics and Trustsentence permalink
Big Data nudgingsentence permalink
Algorithmic bias and transparencysentence permalink
Open Government Datasentence permalink
Manipulation of statements and misinformationsentence permalink
Cluster 2 – Public Governance Framework for Data Driven Policy Making Structuressentence permalink
Forming of societal and political willsentence permalink
Stakeholder/Data-producer-oriented Governance approachessentence permalink
Governance administrative levels and jurisdictional silossentence permalink
Education and personnel development in data sciencessentence permalink
Cluster 3 – Data acquisition, cleaning and representativenesssentence permalink
Real time big data collection and productionsentence permalink
Quality assessment, data cleaning and formattingsentence permalink
Representativeness of data collectedsentence permalink
Cluster 4 – Data storage, clustering, integration and fusionsentence permalink
Big Data storage and processingsentence permalink
Identification of patterns, trends and relevant observablessentence permalink
Extraction of relevant information and feature extractionsentence permalink
Cluster 5 – Modelling and analysis with big datasentence permalink
Identification of suitable modelling schemes inferred from existing datasentence permalink
Collaborative model simulations and scenarios generationsentence permalink
Integration and re-use of modelling schemessentence permalink
Cluster 6 – Data visualizationsentence permalink
Automated visualization of dynamic data in real timesentence permalink
Interactive data visualizationsentence permalink
What follows is a brief explanation of the research challenges.sentence permalink
Research Challenge 1.1 - Big Data nudgingsentence permalink
Following Misuraca (2018), nudging has long been recognized as a powerful tool to achieve policy goals by inducing changes in citizens behaviour, while at the same time presenting risks in terms of respect of individual freedom. Nudging can help governments, for instance, reducing carbon emissions by changing how citizens commute, using data from public and private sources. But it is not clear to what extent can government use these methods without infringing citizens’ freedom of choice. And it is possible to imagine a wide array of malevolent applications by governments with a more pliable definition of human rights.sentence permalink
The recent case of Cambridge Analytica acts as a powerful reminder of the threats deriving from the combination of big data with behavioural science. These benefits and the risks are multiplied by the combination of nudging with big data analytics, becoming a mode of design-based regulation based on algorithmic decision-guidance techniques. When nudging can exploit thousands of data points on any individual, based on data held by governments but also from private sources, the effectiveness of such measures – for good and for bad – are exponentially higher.sentence permalink
Unlike the static nudges, Big Data analytic nudges (also called hypernudging) are extremely powerful due to their continuously updated, dynamic and pervasive nature, working through algorithmic analysis of data streams from multiple sources offering predictive insights concerning habits, preferences and interests of targeted individuals.sentence permalink
In this respect, as pointed out by Yeung (2016), by “highlighting correlations between data items that would not otherwise be observable, these techniques are being used to shape the informational choice context in which individual decision-making occurs, with the aim of channelling attention and decision-making in directions preferred by the ‘choice architect”. In this respect, these techniques constitute a ‘soft’ form of design-based control, and it remains unchartered territory the definition of the scope, limitations and safeguards – both technological and not – to ensure the simultaneous achievement of fundamental policy goals with respect of basic human rights.sentence permalink
Relevance and applications in policy makingsentence permalink
Behavioural change is today a fundamental policy tools across all policy priorities. The great challenges of our time, from climate change to increased inequality to healthy living can only be addressed by the concerted effort of all stakeholders.sentence permalink
But in the present context of declining trust in public institutions and recent awareness of the risk of big data for individual freedoms, any intervention towards greater usage of personal data should be treated with enormous care and appropriate safeguards should be developed.sentence permalink
Notwithstanding the big role of the GDPR, the trust factor is not understood well so far. While there are a number of studies on trust and there exist several trust models explaining trust relations and enabling empirical research on the level of trust, these researches are not yet including the study of trust in big data applications and the impact this may have on human behavioursentence permalink
In this regard, there is the need to assess power and legitimacy of hypernudging to feed real-time policy modelling to inform changes in institutional settings and governance mechanisms, to understand how address key societal challenges exploiting the potential of digital technologies and its impact on institutions and individual and collective behaviours, as well as to anticipate emerging risks and new threats deriving from digital transformation and changes in governance and society.sentence permalink
Technologies, tools and methodologiessentence permalink
This research challenge stems from the combination of machine learning algorithms and behavioural science. Machine learning algorithms can be modelled to find patterns in very large datasets. These algorithms consolidate information and adapt to become increasingly sophisticated and accurate, allowing them to learn automatically without being explicitly programmed.sentence permalink
At the same time, potential safeguards deal with transparency tools to ensure adequate consent by the citizens to be involved in such initiatives, as well as algorithm evaluation mechanisms for potential downside.sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Research Challenge 1.2 - Algorithmic bias and transparencysentence permalink
Many decisions, are today automated and performed by algorithms. Predictive algorithms have been used for many years in public services, whether for predicting risks of hospital admissions or recidivism in criminal justice. Newer ones could predict exam results or job outcomes or help regulators predict patterns of infraction. It’s useful to be able to make violence risk assessments when a call comes into the police, or to make risk assessments of buildings. Health is already being transformed by much better detection of illness, for example, in blood or eye tests.sentence permalink
Algorithms are designed by humans, and increasingly learn by observing human behaviour through data, therefore they tend to adopt the biases of their developers and of society as a whole. As such, algorithmic decision making can reinforce the prejudice and the bias of the data it is fed with, ultimately compromising the basic human rights such as fair process. Bias is typically not written in the code, but developed through machine learning based on data.sentence permalink
For this reason, it is particularly difficult to detect bias, and can be done only through ex-post auditing and simulation rather than ex-ante analysis of the code. There is a need for common practice and tools to controlling data quality, bias and transparency in algorithms. Furthermore, as required by GDPR, there is a need for ways to explain machine decisions in human format.sentence permalink
Furthermore, the risk of manipulation of data should be considered as well, which may lead to ethical misconduct.sentence permalink
Relevance and applications in policy makingsentence permalink
Algorithms are increasingly used to take policy decisions that are potentially life changing, and therefore they must be transparent and accountable. GDPR sets out the clear framework for consent and transparency. Transparency is required for both data and algorithm, but as bias is difficult to detect in the algorithm itself and ultimately it is only through assessment of real-life cases that discrimination is detectable.sentence permalink
Technologies, tools and methodologiessentence permalink
The main relevant methodologies are algorithm co-creation, regulatory technologies, auditability of algorithms, online experiments, data management processing algorithms and data quality governance approaches.sentence permalink
Regarding governance, the ACM U.S. Public Policy Council (USACM) released a statement and a list of seven principles aimed at addressing potential harmful bias of algorithmic solutions: awareness, access and redress, accountability, explanation, data provenance, auditability, validation and testing.sentence permalink
Further, Geoff Mulgan from NESTA has developed a set of guidelines according to which governments can better keep up with fast-changing industries. Similarly, Eddie Copeland from NESTA has developed a “Code of Standards for Public Sector Algorithmic Decision Making.”sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Research Challenge 1.3 - Open Government Datasentence permalink
Open Data are defined as data which is accessible with minimal or no cost, without limitations as to user identity or intent. Therefore, this means that data should be available online in a digital, machine readable format. Specifically, the notion of Open Government Data concerns all the information that governmental bodies produce, collect or pay for. This could include geographical data, statistics, meteorological data, data from publicly funded research projects, traffic and health data.sentence permalink
In this respect the definition of Open Public Data is applicable when that data can be readily and easily consulted and re-used by anyone with access to a computer. In the European Commission's view 'readily accessible' means much more than the mere absence of a restriction of access to the public.sentence permalink
Data openness has resulted in some applications in the commercial field, but by far the most relevant applications are created in the context of government data repositories.sentence permalink
With regard to linked data in particular, most research is being undertaken in other application domains such as medicine. Government starts to play a leading role towards a web of data. However, current research in the field of open and linked data for government is limited. This is all the more true if we take into account Big Data alimented by automatically collected databases.sentence permalink
An important aspect is the risk of personal data included in open government data or personal data being retrieved from the combination of open data sets.sentence permalink
Relevance and applications in policy makingsentence permalink
Clearly opening government data can help in displaying the full economic and social impact of information, and create services based on all the information available. Other core elements in the policy making process include promotion of transparency concerning the destination and use of public expenditure, improvement in the quality of policy making, which becomes more evidence based, increase in the collaboration across government bodies, as well as between government and citizens, increase the awareness of citizens on specific issues, as well as their information about government policies, and promotes accountability of public officials.sentence permalink
Nevertheless transparency does not directly imply accountability. “A government can be an open government, in the sense of being transparent, even if it does not embrace new technology. And a government can provide open data on politically neutral topics even as it remains deeply opaque and unaccountable.” (Robinson & Yu, 2012).sentence permalink
Technologies, tools and methodologiessentence permalink
An interesting topic of research is the integration of open government data, participatory sensing and sentiment analysis, as well as visualization of real-time, high-quality, reusable open government data. Other avenues of research include the provision of quality, cost-effective, reliable preservation and access to the data, as well as the protection of property rights, privacy and security of sensible data.sentence permalink
Inspiring cases include: Open Government Initiative carried out by the Obama Administration for promoting government transparency on a global scale; Data.gov: platform which increases the ability of the public to easily find, download, and use datasets that are generated and held by the Federal Government. In the scope of Data.gov, US and India have developed an open source version called the Open Government Platform (OGPL), which can be downloaded and evaluated by any national Government or state or local entity as a path toward making their data open and transparent.sentence permalink
Research Challenge 1.4 – Manipulation of statements and misinformationsentence permalink
Clearly transparency of policy making and overall trust can be negatively affected by fake news, disinformation and misinformation in general. In a more general sense disinformation can be defined as false information that is purposely spread to deceive people, while misinformation deals with false or misleading information (Lazer et al., 2018), but it also includes the bias that is inherent in news produced by humans with human biases. Lazer et al. (1094) define this most recent phenomenon as ‘fabricated information that mimics news media content in form but not in organizational process or intent.’sentence permalink
This is hardly a modern issue: what changes in the era of big data, is the velocity according to which fake news and false information spread through social media.sentence permalink
Another example related to big data technologies and that will become even more crucial in the future is the one of deepfakes (portmanteau of "deep learning" and "fake"), which is an artificial intelligence-based human image synthesis technique used to combine and superimpose existing images and videos onto source images or videos.sentence permalink
Relevance and applications in policy makingsentence permalink
Fake news and misinformation lead to the erosion of trust in public institutions and traditional media sources, and in turn favour the electoral success of populist or anti-establishment parties. In fact, as discussed in Allcott and Gentzkow (2017) and Guess et al. (2018), Trump voters were more likely to be exposed and believe to misinformation. In the Italian context, il Sole 24 Ore found that the consumption of fake news appear to be linked with populism, but the content of the overwhelming majority of pieces of misinformation also displays an obvious anti-establishment bias, as found in Giglietto et al. (2018).sentence permalink
In the recent 2016 US presidential election, there has been the creation and spread of news articles that favoured or attacked one of the two main candidates, Hillary Clinton and Donald Trump, in order to steer the public opinion towards one candidate or the other.sentence permalink
Furthermore, the success of Brexit referendum is another example of how fake news steered the public opinion towards beliefs that are hardly funded on evidence, e.g. the claim that UK was sending £350m a week to the EU, and that this money could be used to fund NHS instead.sentence permalink
Technologies, tools and methodologiessentence permalink
In the short term, raising awareness regarding fake news can be an important first step. For instance, the capability to judge is a source is reliable or the capability to triangulate different data sources is crucial in this regard. Furthermore, educating people on the capabilities of AI algorithms will be a good measure to prevent the bad uses of applications like FakeApp having widespread impact.sentence permalink
Regarding technologies to counter fake news, NLP can help to classify text into fake and legitimate instances. In fact, NLP can be used for deception detection in text, and fake news articles can be considered as deceptive text (Chen et al., 2015; Feng et al., 2012; Pérez-Rosas and Mihalcea, 2015). More recently, deep learning has taken over in case large-scale training data is available. For what concerns text classification, feature-based models, recurrent neural networks (RNNs) models, convolutional neural networks (CNNs) models and attention models have been competing (Le and Mikolov, 2014; Zhang et al., 2015; Yang et al., 2016; Conneau et al., 2017; Medvedeva et al., 2017).sentence permalink
Clearly all leading machine learning techniques for text classification, including feature-based and neural network models, are heavily data-driven, and therefore require quality training data based on sufficiently diverse and carefully labelled set of legitimate and fake news articles.sentence permalink
Regarding deepfakes, another possibility is to make use of blockchain technologies, in which every record is replicated on multiple computers and tied to a pair of public and private encryption keys. In this way, the person/institution holding the private key will be the true owner of the data, not the computers storing it. Furthermore, blockchains are rarely affected by security threats, which in turn can attack centralized data stores. As an example, individuals could make of the blockchain to digitally sign and confirm the authenticity of a video or audio file. The more the digital signature, the more is the likelihood that a document is authentic.sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Do you want to add any other research challenge?sentence permalink
Research Challenge 2.1 - Forming and monitoring of societal and political willsentence permalink
Many efforts have been undertaken by European governments to establish data platforms and of course, the present development in the open data movement contributes to data driven decisions in the public sector, but is the status quo sufficient or what is needed to leverage data for an advanced data based decision support in the public sector? The legislative and political objectives are often neither clear nor discussed in advance. This leads to the fact, that a huge amount of data is certainly available but not the right data sets to assess specific political problems. In that sense, governance structures and frameworks like outcome and target oriented approaches are needed in order to be able to make the right data available and furthermore, to interpret these data bearing in mind societal and legislative goals (Schmeling et al.)sentence permalink
Relevance and applications in policy makingsentence permalink
Objectives in the public sector can be multifarious since they are aimed at the common good and not only prior on profit maximisation. Therefore, shared targets have the potential to transform common policies and legislative intentions on a horizontal and a vertical level into public organisations (James and Nakamura, 2015)sentence permalink
Technologies, tools and methodologiessentence permalink
Research is needed to investigate how political and societal will can be operationalized in order to be able to design monitoring systems and performance measurement systems based not simply on financial information but rather on outcome and performance-oriented indicators.sentence permalink
An interesting case is given by the TNO policy lab for the co-creation of data-driven policy making. The Policy Lab is a methodology for conducting controlled experiments with new data sources and new technologies for creating data-driven policies. Policy makers experiment with new policies in a safe environment and then scale up. The Policy Lab approach has three pillars: (1) the use of new data sources as sensor data and technological developments for policy development; (2) a multidisciplinary approach: including data science, legal expertise, domain knowledge, etc.; and (3) involving citizens and other stakeholders ('co-creation') and carefully weighing different values.sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Research Challenge 2.2 - Stakeholder/Data-producer-oriented Governance approachessentence permalink
To enhance the evidence-based decisions in policy making, data must be gathered from different sources and stakeholders respectively including company data, citizens’ data, third sector data and Public Administrations’ Data. Every Stakeholder group requires different approaches to provide and exchange data. These approaches must consider political, administrative, legal, societal, management and ICT related conditions.sentence permalink
As a plurality of independent stakeholder groups is involved in the fragmented process of data collection, the governance mode cannot be based on a hierarchical structure. Thus, the network governance approach applies rather on negotiation-based interactions that are privileged to aggregate information, knowledge and assessments that can help qualifying political decisions (Sørensen and Torfing, 2007).sentence permalink
The public administration is in its origin an important advisor of the political system and is not to be underestimated in this context, since the administration owns meaningful data, which should be considered profoundly in political decision making. In addition, the roles and responsibilities of public administrations as data providers must be discussed and clarified.sentence permalink
If specific company data like traffic data from navigation device providers or social media data from social network providers is necessary to assess political questions, guidance and governance models to purchase or exchange this data is needed.sentence permalink
For all aforementioned cases IT standards and IT architecture frameworks for processing data stored in different infrastructures constituting so called data spaces are required (Cuno et al., 2019).sentence permalink
In this regard, an import role is played by massive interconnection, i.e. massive number of objects/things/sensors/devices connected through the information and communications infrastructure to provide value-added services, in particular in the context of smart cities initiatives. The unprecedented availability of data raises obvious concerns for data protection, but also stretch the applicability of traditional safeguards such as informed consent and anonymization (see Kokkinakos et al. 2016).sentence permalink
Data gathered through sensors and other IoT typically are transparent to the user and therefore limit the possibility for informed consent, such as the all too familiar “accept” button in websites. Secondly, the sheer amount of data makes anonymization and pseudonymisation more difficult as most personal data can be easily deanonymized. Advanced techniques such as multiparty computation and homomorphic encryption remain too resource intensive for large scale deployment.sentence permalink
We need robust, modular, scalable anonymization algorithms that guarantee anonymity by adapting to the input (additional datasets) and to the output (purpose of use) by adopting a risk-based approach. Additionally, it is important to ensure adequate forms of consent management across organization and symmetric transparency, allowing citizens to see how their data are being used, by whom and for what purpose.sentence permalink
Clearly sometimes the options are limited, as in the case of geo-positioning, which is needed to be able to use the services provided. Basically, in this case the user pays with their data to use services.sentence permalink
Relevance and applications in policy makingsentence permalink
Big data offer the potential for public administrations to obtain valuable insights from a large amount of data collected through various sources, and the IoT allows the integration of sensors, radio frequency identification, and Bluetooth in the real-world environment using highly networked services.sentence permalink
The trend towards personalized services only increases the strategic importance of personal data, but simultaneously highlight the urgency of identifying workable solutions.sentence permalink
On the other hand, when talking about once only principle, bureaucracy and intra-organisational interoperability are far more critical.sentence permalink
Technologies, tools and methodologiessentence permalink
Several tools are today being developed in this area. Blockchain providing an authentication for machine to machine transaction: blockchain of things. More specifically, inadequate data security and trust of current IoT are seriously limiting its adoption. Blockchain, a distributed and tamper-resistant ledger, maintains consistent records of data at different locations, and has the potential to address the data security concern in IoT networks (Reyna et al. 2018).sentence permalink
Anonymization algorithms and secure multiparty mining algorithm over distributed datasets allow guaranteeing anonymity even when additional datasets are analysed and the partitioning of data mining over different parties (Selva Rathna and Karthikeyan 2015).sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Research Challenge 2.3 - Governance administrative levels and jurisdictional silossentence permalink
Decisions in the political environment are often facing trans-boundary problems on different administrative levels and in different jurisdictions. Thus, the data collection to understand these problems and to investigate possible solutions causes manifold barriers and constraints, which have to be overcome through modern governance approaches and models.sentence permalink
Like the aforementioned stakeholder network of data providers, a data network has to be coordinated on meta-level and respective rules and access rights have to be established ICT-enabled through data connectors or controlled harvesting methods.sentence permalink
This is becoming increasingly urgent as government holds massive and fastly growing amounts of data that are dramatically underexploited. The achievement of the once only principle, as well the opportunities of big data only add to the urgency.sentence permalink
Interoperability of government data, as well as the issues of data centralization versus federation, as well as data protection, remain challenges to be dealt with. New solutions are needed that balance the need for data integration with the safeguards on data protection, the demand for data centralisation with the need to respect each administration autonomy, and the requirement for ex ante homogenization with more pragmatic, on demand approaches based on the “data lake” paradigm. All this need to take place at European level, to ensure the achievement of the goals of the Tallinn declaration.sentence permalink
And appropriate, modular data access and interoperability is further complicated by the need to include private data sources as provider and user of government data, at the appropriate level of granularity. Last but not least, this needs to work with full transparency and full consent by citizens, ideally enabling citizens to track in real time who is accessing their personal data and for what purposes.sentence permalink
Relevance and applications in policy makingsentence permalink
Data integration has long been a priority for public administration but with the new European Interoperability Framework and the objective of the once only principle is has become an unavoidable priority. Data integration and integrity are the basic building blocks for ensuring sufficient data quality for decision-makers – when dealing with strategic policy decision and when dealing with day to day decisions in case management.sentence permalink
Technologies, tools and methodologiessentence permalink
New interface within which the single administrations can communicate and share data and APIs in a free and open way, allowing for the creation of new and previously-unthinkable services and data applications realised on the basis of the needs of the citizen.sentence permalink
As an example, the Data & Analytics Framework (DAF) by the Italian Digital Team aims to develop and simplify the interoperability of public data between PAs, standardize and promote the dissemination of open data, optimize data analysis processes and generate knowledge.sentence permalink
Another interesting example is given by the X-Road, which is an infrastructure which allows the Estonian various public and private sector e-service information systems to link up. Currently, the infrastructure is implemented also in Finland, Kyrgyzstan, Namibia, Faroe Islands, Iceland, and Ukraine.sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Research Challenge 2.4 - Education and personnel development in data sciencessentence permalink
Governance plays also an important role on all questions of education and personnel development in order to ensure that the right capabilities are available in terms of data literacy, data management and interpretation. The need to develop these skills has to be managed and governed as a basis to design HR strategies, trainings and employee developments. Relevance and applications in policy makingsentence permalink
Governance in personnel development promotes effective and efficient fulfillment of public duties like evidence based policymaking.sentence permalink
This is all the more true when taking into account the use of Big Data in policy making, as clearly the skills and competence of civil servants are very important for the implementation of reforms and take up of data strategies and solutions.sentence permalink
Technologies, tools and methodologiessentence permalink
This research challenge includes focusing on standards to make transparent the assessment criteria of education policies, incentives to motivate specific types of behavior, information in the way of clear definitions of outputs and outcomes and accountability to examine that given outcomes and outputs can be delivered (Lewis and Pettersson, 2009).sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Do you want to add any other research challenge?sentence permalink
Research Challenge 3.1 - Real time big data collection and productionsentence permalink
The rapid development of the Internet and web technologies allows ordinary users to generate vast amounts of data about their daily lives. On the Internet of Things (IoT), the number of connected devices has grown exponentially; each of these produces real-time or near real-time streaming data about our physical world. In the IoT paradigm, an enormous amount of networking sensors are embedded into various devices and machines in the real world. Such sensors deployed in different fields may collect various kinds of data, such as environmental data, geographical data, astronomical data, and logistic data. Mobile equipment, transportation facilities, public facilities, and home appliances could all be data acquisition equipment in IoT.sentence permalink
Furthermore, social media analytics deals with collecting data from social media websites like Facebook, Twitter, YouTube, WhatsApp etc. and blogs. Social media analytics can be categorized under big data because the data generated out of the social websites are in huge number, so that some efficient tools and algorithms are required for analysing the data. Data collected include user-generated content (tweets, posts, photos, videos), digital footprints (IP address, preferences, cookies), Mobility data (GPS data), Biometric information (fingerprints, fitness trackers data), and consumption behaviour (credit cards, supermarket fidelity cards).sentence permalink
Relevance and applications in policy makingsentence permalink
The collection of such amounts of data in real time can help in updated evaluation of policies, in monitoring the effects of policy implementations, in collecting data that can be used for agenda setting (for instance traffic data), as well as for the analysis of the sentiment and behaviour of the citizens, monitoring and evaluating government social media communication and engagement.sentence permalink
Technologies, tools and methodologiessentence permalink
For collecting the data from devices, an obvious choice is given by the Internet of Things technologies. Regarding social media, there are many collection and analytics tools readily available for collecting and analysing content. These tools help in collecting the data from the social websites and its service not only stop with data collection but also helps in analysing the usage of data. Examples of tools and technologies are online sentiment analysis and data mining, APIs, data crawling, data scraping.sentence permalink
What is interesting about the development of such tools, is the development of automated technological tools that can collect, clean, store and analyse large volumes of data at high velocity. Indeed, in some instances, social media has the potential to generate population level data in near real-time.sentence permalink
Methodologies used to produce analysis from social media data include Regression Modelling, GIS, Correlation and ANOVA, Network Analysis, Semantic Analysis, Pseudo-Experiments, and Ethnographic Observations. A possible application includes also the project (https://info.openlaws.com/eu-project/), dealing with Big Open Legal Data.sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Research Challenge 3.2 - Quality assessment, data cleaning and formattingsentence permalink
Big Data Quality assessment is an important phase integrated within data pre-processing. It is a phase where the data is prepared following the user or application requirements. When the data is well defined with a schema, or in a tabular format, its quality evaluation becomes easier as the data description will help mapping the attributes to quality dimensions and set the quality requirements as baseline to assess the quality metrics.sentence permalink
After the assessment of data quality, it is time for data cleaning. This is the process of correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data or coarse data.sentence permalink
This research challenge also deals with formatting, as once one has downloaded sets of data is not obvious at all that their format will be suitable for further analysis and integration in the existing platforms. And another important factor is metadata, which are important for transparency and completeness of information.sentence permalink
Relevance and applications in policy makingsentence permalink
Apart from systematic errors in data collection, it is important to assess to extent to which the data are of quality, and to amend it, obviously because policy decisions have to be funded on quality data and therefore have to be reliable.sentence permalink
More data does not necessarily mean good or better data, and many of the data available lack the quality required for its safe use in many applications, especially when we are talking about data coming from social networks and internet of things.sentence permalink
Technologies, tools and methodologiessentence permalink
Regarding data quality, it is mandatory to use existing and develop new frameworks including big data quality dimensions, quality characteristics, and quality indexes. For what concerns data cleaning, the need for overcoming the hurdle is driving development of technologies that can automate data cleansing processes to help accelerate business analytics.sentence permalink
Considering frameworks for quality assessment, the UNECE Big Data Quality Task Team released in 2014 a framework for the Framework for the Quality of Big Data within the scope of the UNECE/HLG project “The Role of Big Data in the Modernisation of Statistical Production” (UNECE 2014). The Big Data Quality framework developed provides a structured view of quality at three phases of the business process: i.e. Input (acquisition analysis of the data); Throughput (transformation, manipulation and analysis of the data); Output (the reporting of quality with statistical outputs derived from big data sources).sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Research Challenge 3.3 - Representativeness of data collectedsentence permalink
A key concern with many Big Data sources is the selectivity, (or conversely, the representativeness) of the dataset. A dataset that is highly unrepresentative may nonetheless be useable for some purposes but inadequate for others. Related to this issue is the whether there exists the ability to calibrate the dataset or perform external validity checks using reference datasets. Selectivity indicators developed for survey data can usually be used to measure how the information available on the Big Data Source differs from the information for the in-scope population.sentence permalink
For example, we can compare how in-scope units included in Big Data differ from in-scope units missing from the Big Data. To assess the difference, it is useful to consider the use of covariates, or variables that contain information that allows to determine the “profile” of the units (for example, geographic location, size, age, etc.) to create domains of interest. It is within these domains that comparisons should be made for “outcome” or study variables of interest (for example, energy consumption, hours worked, etc.). Note that the covariates chosen to create the domains should be related to the study variables being compared.sentence permalink
Regarding social media, research defines a set of challenges that have implications for have implications for validity and reliability of data collected. First, users of social media are not representative of populations (Ruths & Jurgen, 2014). As such, biases will exist and it may be difficult to infer findings to the general population. Furthermore, social media data is seldom created for research purposes, and finally it is difficult to infer how reflective a user’s online behaviour is of their offline behaviour without information on them from other sources (Social Media Research Group 2016).sentence permalink
Relevance and applications in policy makingsentence permalink
Clearly big data representativeness is crucial to policy making, especially when studying certain characteristics of the population and in analysing its sentiment. It is also important of course when tackling certain subgroups.sentence permalink
In this regard, large datasets may not represent the underlying population of interest and sheer largeness of a dataset clearly does not imply that population parameters can be estimated without bias.sentence permalink
Technologies, tools and methodologiessentence permalink
Appropriate sampling design has to be applied in order to ensure representativeness of data and limit the original bias when present. Probability sampling methodologies include: simple random sampling, stratified sampling, cluster sampling, multistage sampling, and systematic sampling. An interesting research area is survey data integration, which aims to combine information from two independent surveys from the same target population.sentence permalink
Kim et al. (2016) propose a new method of survey data integration using fractional imputation, and Park et al. (2017) use a measurement error model to combine information from two independent surveys. Further, Kim and Wang (2018) propose two methods of reducing the selection bias associated with the big data sample. Finally, Tufekci (2014) provides a set of practical steps aimed at mitigating the issue of representativeness, including: targeting non-social dependent variables, establishment of baseline panels to study people’s behaviour, use of multidisciplinary teams and multimethod/multiplatform analysis.sentence permalink
Big Data can be also combined with 'traditional' datasets to improve representativeness (Vaitla 2014).sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Do you want to add any other research challenge?sentence permalink
Research Challenge 4.1 – Big Data Storagesentence permalink
Obviously a pre-requisite for clustering, integration and fusion of big data is the presence of efficient mechanisms for data storage and processing. Clearly, Big data storage technologies are a key enabler for advanced analytics that have the potential to transform society and the way key decisions are made, also in terms of policy. One of the first things organizations have to manage when dealing with big data, is where and how this data will be stored once it is acquired. The traditional methods of structured data storage and retrieval include relational databases and data warehouses.sentence permalink
Relevance and applications in policy makingsentence permalink
Clearly the data acquired by the public administration, to be subsequently used for analytics, modelling and visualization, need to be stored efficiently and safely. In this regard, it is important to understand the encryption and migration needs, the privacy requirements, as well as the procedures for backup or disaster recovery.sentence permalink
Furthermore, big data storage and processing technologies are able to produce information that can enhance different public servicessentence permalink
Technologies, tools and methodologiessentence permalink
This research topic has been developing rapidly in the last years, delivering new types of massive data storage and processing products e.g. NoSQL knowledge bases. Basing on the advances of cloud computing, the technology market is very developed in this area (for an overview, see Sharma, 2016). Crowdsourcing also plays an important role, and in the light of the climate change and environmental issues energy-efficient data storage methods are also a crucial research priority (Strohbach et al. 2016). Furthermore, to automate complex tasks and make them scalable, hybrid human-algorithmic data curation approaches have to be further developed (Freitas and Curry 2016).sentence permalink
More specifically, the most important technologies are: distributed File Systems such as the Hadoop File System (HDFS), NoSQL and NewSQL Databases, and Big Data Querying Platforms.sentence permalink
On the other hand interesting tools are: Cassandra, Hbase (George, 2011), MangDB, CouchDB, Voldemort, DynamoDB, and Redissentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Research Challenge 4.2 - Identification of patterns, trends and relevant observablessentence permalink
This research challenge deals with technologies and methodologies allowing businesses and policy makers to identify patterns and trends of data both structured and unstructured that may have not been previously visible.sentence permalink
Relevance and applications in policy makingsentence permalink
Clearly the possibility to extract patterns and trends in data can help the policy maker in having a first sight for discovering issues that are the used to develop the policy agenda. An interesting application is anomaly detection, which is most commonly used in fraud detection. For example, anomaly detection can identify suspicious activity in a database and trigger a response. There is usually some level of machine learning involved in this case.sentence permalink
Technologies, tools and methodologiessentence permalink
One of the most used Big Data methodologies for identification of pattern and trends is data mining. Combination of database management, statistics and machine learning methods useful for extracting patterns from large datasets.sentence permalink
Some examples include mining human resources data in order to assess some employee characteristics or consumer bundle analysis to model the behaviour of customers.sentence permalink
It has also to be taken into account that most of the Big Data are not structured and have a huge quantity of text. In this regard, text mining is another technique that can be adopted to identify trends and patterns.sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Research Challenge 4.3 - Extraction of relevant information and feature extractionsentence permalink
Summarizing data and meaning extraction to provide a near real time analysis of the data. Some analysis require that data must be structured prior to perform them in an homogeneous way, as algorithms unlike humans are not able to grasp nuance. Furthermore, most computer systems work better if multiple items are stored in an identical size and structure.sentence permalink
But an efficient representation, access and analysis of semi‐structured data is necessary because as a less structured design is more useful for certain analysis and purposes. Even after cleaning and error correction in the database, some errors and incompleteness will remain, challenging the precision of the analysis.sentence permalink
Relevance and applications in policy makingsentence permalink
While information and feature extraction could appear far from the policy process, it is a fundamental requirement to ensure the veracity of the information obtained and to reduce the effort from the following phases, ensuring the widest reuse of the data for purposes different from the one it was originally gathered. The data have to be adapted according to the use and analysis that are destined too, and moreover they are needed as data preparation for visualization.sentence permalink
Technologies, tools and methodologiessentence permalink
Bayesian techniques for meaning extraction; extraction and integration of knowledge from massive, complex, multi-modal, or dynamic data; data mining; scalable machine learning; principal component analysis. Tools include Nosql, hadhoop, deep learning, rapidminer, keymine, R, phython, and sensor data processing (fog and edge computing).sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Do you want to add any other research challenge?sentence permalink
Research Challenge 5.1 - Identification of suitable modelling schemes inferred from existing datasentence permalink
The traditional way of modelling started with a hypothesis about how a system acts. Then collect data to represent the stimulus. Traditionally, the amount of data collected was small since it rarely already existed, had to be generated with surveys, or perhaps imputed through analogies. Finally, statistical methods established enough causality to arrive at enough truth to represent the system.sentence permalink
So deductive models are forward running, so they end up representing a system not observed before. On the other hand, with the current huge availability of data, it is possible to identify and create new suitable modelling schemes that build on existing data.sentence permalink
These are inductive models that start by observing a system already in place and one that is putting out data as a by-product of its operation. In this respect, the real challenge is to be able to identify and validate from existing data models that are valid and suitable to cope with complexity and unanticipated knowledge.sentence permalink
Model validation is composed of two main phases. The first phase is conceptual model validation, i.e. determining that theories and assumptions underlying the conceptual model are correct. A second phase is the computerised model verification, that ensures that computer programming and implementation of the conceptual model are correct.sentence permalink
Relevance and applications in policy makingsentence permalink
There are several aspects related to the identification and validation of modelling schemes that are important in policy making. A first deals with the reliability of models: policy makers use simulation results to develop effective policies that have an important impact on citizens, public administration and other stakeholders. Identification and validation is fundamental to guarantee that the output of analysis for policy makers is reliable.sentence permalink
Another aspect is the acceleration of the policy modelling process: policy models must be developed in a timely manner and at minimum cost in order to efficiently and effectively support policy makers. Model identification and validation is both cost and time consuming and if automated and accelerated can lead to a general acceleration of the policy modelling process.sentence permalink
Technologies, tools and methodologiessentence permalink
In current practice the most frequently used is a decision of the development team based on the results of the various tests and evaluations conducted as part of the model development process. Another approach is to engage users in the choice and validation process. At any rate, conducting model validation concurrently with the development of the simulation model enables the model development team to receive inputs earlier on each stage of model development.sentence permalink
Therefore, ICT Tools for speeding up, automating and integrating model validation process into policy model development process are necessary to guarantee the validity of models with an effective use of resources. It has finally to be noticed that model validation is not a discrete step in the simulation process. It needs to be applied continuously from the formulation of the problem to the implementation of the study findings as a completely validated and verified model does not exist.sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Research Challenge 5.2 - Collaborative model simulations and scenarios generationsentence permalink
This methodology encompasses participation of all stakeholders in the policy-making process through the implementation of online-based easy-to-use tools for all the levels of skills. Decision-making processes have to be supported with meaningful representations of the present situations along with accurate simulation engines to generate and evaluate future scenarios.sentence permalink
Instrumental to all this is the possibility to gather and analyze huge amounts of relevant data and visualize them in a meaningful way also for an audience without technical or scientific expertise. Citizens should also be allowed for probing and real-time data collection for feeding simulation machines at real time, and/or contributing by mean of some sort of online platform.sentence permalink
Understanding the present through data is often not enough and the impact of specific decisions and solutions can be correctly assessed only when projected into the future. Hence the need of tools allowing for a realistic forecast of how a change in the current conditions will affect and modify the future scenario. In short scenario simulators and decision support tools.sentence permalink
In this framework it is highly important to launch new research directions aimed at developing effective infrastructures merging the science of data with the development of highly predictive models, to come up with engaging and meaningful visualizations and friendly scenario simulation engines.sentence permalink
The weakest form of involvement is feedback to the session facilitator, similar to the conventional way of modelling. Stronger forms are proposals for changes or (partial) model proposals. In this particular approach the modelling process should be supported by a combination of narrative scenarios, modelling rules, and e-Participation tools (all Integrated via an ICT e‐Governance platform): so the policy model for a given domain can be created iteratively using cooperation of several stakeholder groups (decision makers, analysts, companies, civic society, and the general public.sentence permalink
Relevance and applications in policy makingsentence permalink
Clearly the collaboration of several individuals in the simulation and scenario generation allows for policies and impact thereof to be better understood by non-specialists and even by citizens, ensuring a higher acceptance and take up. Furthermore, as citizens have the possibility to intervene in the elaboration of policies, user centricity is achieved.sentence permalink
On the other hand, modelling co-creation has also other advantages: no person typically understands all requirements and understanding tends to be distributed across a number of individuals; a group is better capable of pointing out shortcomings than an individual; individuals who participate during analysis and design are more likely to cooperate during implementation.sentence permalink
Technologies, tools and methodologiessentence permalink
CityChrone++ is one of the instantiations of a larger platform dubbed what if-machine (link to whatif.caslparis.com), aimed at providing users with tools to assess the status of our urban and inter-urban spaces and conceive new solutions and new scenarios. The platform integrates flexible data analysis tools with a simple scenario simulation platform in the area of urban accessibility, with a focus on human mobility. In this framework, it will be important to parallel the platform with effective modelling schemes, key for the generation and the assessment of new scenarios.sentence permalink
United Nations Global Policy Model (GPM): this is a tool for investigation of policy scenarios for the world economy. The model is intended to trace historical developments and potential future impacts of trends, shocks, policy initiatives and responses over short, medium and long-term timescales, in the view to provide new insights into problems of policy design and coordination. Recently, the model has been applied to the assessment of possible policy scenarios and implication for the world economy in a post-Brexit setting.sentence permalink
The European Central Bank New Area-Wide Model (NAWM): dynamic stochastic general equilibrium model reproducing the dynamic effects of changes in monetary policy interest rates observed in identified Variable Autoregression Models (VARs). The building blocks are: agents (e.g. households and firms), real and nominal frictions (e.g. habit formation, adjustment costs), financial frictions (domestic and external risk premium), rest-of-World block (SVAR). It is estimated on time series for 18 key macro variables employing Bayesian inference methods. The model is regularly used for counterfactual policy analysis.sentence permalink
TELL ME Model: this a prototype agent-based model, developed within the scope of the European-funded TELL ME project, intended to be used by health communicators to understand the potential effects of different communication plans under various influenza epidemic scenarios. The model is built on two main building blocks: a behaviour model that simulates the way in which people respond to communication and make decisions about whether to vaccinate or adopt other protective behaviour, and an epidemic model that simulates the spread of influenza.sentence permalink
Households and Practices in Energy use Scenarios (HOPES): agent based model aimed to to explore the dynamics of energy use in households. The model has two types of agents: households and practices. Elements (meanings, materials and skills) are entities in the model. The model concept is that households choose different elements to perform practices depending on the socio-technical settings unique to each household. The model is used to test different policy and innovation scenarios to explore the impacts of the performance of practices on energy use.sentence permalink
Global epidemic and mobility model (GLEAM): big data and high performance computing model combining real-world data on populations and human mobility with elaborate stochastic models of disease transmission to model the spread of an influenza-like disease around the globe, in order to be able to test intervention strategies that could minimize the impact of potentially devastating epidemics. An interesting application case quantification of the risk of local Zika virus transmission in the continental US during the 2015-2016 ZIKV epidemic.sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Research Challenge 5.3 - Integration and re-use of modelling schemessentence permalink
This research challenge seeks to find the way to model a system by using already existing models or composing more comprehensive models by using smaller building blocks, either by reusing existing objects/models or by generating/building them from the very beginning. Therefore, the most important issue is the definition/identification of proper (or most apt) modelling standards, procedures and methodologies by using existing ones or by defining new ones.sentence permalink
Further to that, the present sub-challenge calls for establishing the formal mechanisms by which models might be integrated in order to build bigger models or to simply exchange data and valuable information between the models. Finally, the issue of model interoperability as well as the availability of interoperable modelling environments should be tackled, as well as the need for feedback-rich models that are transparent and easy for the public and decision makers to understand.sentence permalink
Relevance and applications in policy makingsentence permalink
In systems analysis, it is common to deal with the complexity of an entire system by considering it to consist of interrelated sub-systems. This leads naturally to consider models as consisting of sub-models. Such a (conceptual) model can be implemented as a computer model that consists of a number of connected component models (or modules). Component-oriented designs actually represent a natural choice for building scalable, robust, large-scale applications, and to maximize the ease of maintenance in a variety of domains.sentence permalink
An implementation based on component models has at least two major advantages. First, new models can be constructed by coupling existing component models of known and guaranteed quality with new component models. This has the potential to increase the speed of development. Secondly, the forecasting capabilities of several different component models can be compared, as opposed to compare whole simulation systems as the only option. Further, common and frequently used functionalities, such as numerical integration services, visualization and statistical ex-post analyses tools, can be implemented as generic tools and developed once for all and easily shared by model developers.sentence permalink
Technologies, tools and methodologiessentence permalink
The CEF BDTI building block provides virtual environments that are built based on a mix of mature open source and off-the-shelf tools and technologies. The building block can be used to experiment with big data sources and models and test concepts and develop pilot projects on big data in a virtual environment. Each of these environments are based on a template that supports one or more use cases. These templates can be deployed, launched and managed as separate software environments.sentence permalink
Specifically, the Big Data Test Infrastructure will provide a set of data and analytics services, from infrastructure, tools and stakeholder onboarding services, allowing European public organisations to experiment with Big Data technologies and move towards data-driven decision making. Applicability of the BDTI includes descriptive analysis, Social Media Analysis, Time-series Analysis, Predictive analysis, Network Analysis, and Text Analysis.sentence permalink
Specifically, BDTI allows public organizations to experiment with big data sources, methods and tools; launch pilot projects on big data and data analytics through a selection of software tools, acquire support and have access to best practice and methodologies on big data; share data sources across policy domains and organisations.sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Do you want to add any other research challenge?sentence permalink
Research Challenge 6.1 - Automated visualization of dynamic data in real timesentence permalink
Due to continuing advances in sensor technology and increasing availability of digital infrastructure that allows for acquisition, transfer, and storage of big data sets, large amounts of data become available even in real-time.sentence permalink
Since most analysis and visualization methods focus on static data sets, adding a dynamic component to the data source results in major challenges for both the automated and visual analysis methods. Besides typical technical challenges such as unpredictable data volumes, unexpected data features and unforeseen extreme values, a major challenge is the capability of analysis methods to work incrementally.sentence permalink
Furthermore, scalability of visualization in face of big data availability is a permanent challenge, since visualization requires additional performances with respect to traditional analytics in order to allow for real time interaction and reduce latency.sentence permalink
Finally, visualization is largely a demand-and design-driven research area. In this sense one of the main challenges is to ensure the multidisciplinary collaboration of engineering, statistics, computer science and graphic design.sentence permalink
Relevance and applications in policy makingsentence permalink
Visualization of dynamic data in real time allows policy makers to react timely with respect to issues they face. An example can be given by movement data (e.g., road, naval, or air-traffic) enabling analysis in several application fields (e.g., landscape planning and design, urban development, and infrastructure planning).sentence permalink
In this regard, it helps in identifying problems at an early stage, detect the “unknown unknown” and anticipate crisis: visual analytics of data in real time are for instance largely used in the intelligence community because they help exploiting the human capacity to detect unexpected patterns and connections between data.sentence permalink
Technologies, tools and methodologiessentence permalink
Methodologies for bringing out meaningful patterns include data mining, machine learning, and statistical methods. Tools for management and automated analysis of data streams include: CViz Cluster visualisation, IBM ILOG visualisation, Survey Visualizer, Infoscope, Sentinel Visualizer, Grapheur2.0, InstantAtlas, Miner3D, VisuMap, Drillet, Eaagle, GraphInsight, Gsharp, Tableau, Sisense, SAS Visual Analytics. Apart from acquiring and storing the data, great emphasis must be given to the analytics and DSS algorithms that will be used.sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Research Challenge 6.2 - Interactive data visualizationsentence permalink
With the advent of Big Data simulations and models grow in size and complexity, and therefore the process of analysing and visualising the resulting large amounts of data becomes an increasingly difficult task. Traditionally, visualisations were performed as post-processing steps after an analysis or simulation had been completed. As simulations increased in size, this task became increasingly difficult, often requiring significant computation, high-performance machines, high capacity storage, and high bandwidth networks.sentence permalink
In this regard, there is the need of emerging technologies that addresses this problem by “closing the loop” and providing a mechanism for integrating modelling, simulation, data analysis and visualisation. This integration allows a researcher to interactively perform data analysis while avoiding many of the pitfalls associated with the traditional batch / post processing cycle. This integration also plays a crucial role in making the analysis process more extensive and, at the same time, comprehensible.sentence permalink
Relevance and applications in policy makingsentence permalink
Policy makers should be able to independently visualize results of analysis. In this respect, one of the main benefits of interactive data visualization is basically to generate high involvement of citizens in policy-making.sentence permalink
One of the main applications of visualization is in making sense of large datasets and identifying key variables and causal relationships in a non-technical way. Similarly, it enables non-technical users to make sense of data and interact with them. Secondly, it helps to understand the impact of policies: interactive visualization is instrumental in making evaluation of policy impact more effective.sentence permalink
Technologies, tools and methodologiessentence permalink
Visualisation tools are still largely designed for analyst and are not accessible to non‐experts. Intuitive interfaces and devices are needed to interact with data results through clear visualisations and meaningful representations. User acceptability is a challenge in this sense, and clear comparisons with previous systems to assess its adequacy.sentence permalink
Furthermore, a good visual analytics system has to combine the advantages of the automatic analysis with interactive techniques to explore data. Behind this desired technical feature there is the deeper aim to integrate the analytic capability of a computer with the abilities of the human analysis. In this regard, an interesting case is given by the project BigDataOcean.sentence permalink
An interesting approach would be to look into two, or even three, tiers of visualisation tools for different types of users: experts and analysts, decision makers (which are usually not technical experts but must understand the results, make informed decisions and communicate their rationale), and the general public. Visualisation for the general public will support buy-in for the resulting policies as well as the practice of data-driven policy making in general.sentence permalink
Tools available on the market include imMens system, BigVis package for R, Nanocubes, MapD, D3.js, AnyChart, and ScalaR projects, who all use various database techniques to provide fast queries for interactive exploration.sentence permalink
Do you agree with the research challenge (please comment above in line)?sentence permalink
Can you suggest any application case, tool, methodology (please comment above in line)?sentence permalink
Do you want to add any other research challenge?sentence permalink