Data-Driven Best Practice Identification
A scientific knowledge extraction methodology for identifying, synthesizing, and curating best practices at the intersection of Artificial Intelligence and Circular Economy from comprehensive scientific databases.
Project Context
This platform implements Work Package 2 of the AI-InnoScEnCe project, comprising two interconnected tasks: Task 2.1: Data-driven Best-Practice Identification and Task 2.2: Best-Practice Curation. The objective is to systematically analyze the intersection of artificial intelligence and circular economy research, extracting and curating actionable best practices from scientific literature to support higher education institutions (HEIs) and individual researchers across Europe.
Research Objectives
- Comprehensive coverage through scientific database integration (OpenAlex, Lens.org)
- Systematic filtering of global knowledge base for AI applications in circular economy
- Knowledge synthesis through semantic analysis and topic-based clustering
- Interactive map visualization enabling HEIs to locate best practices by research focus and proximity
- Multimodal content curation for diverse learning preferences and accessibility
Work Package 2: Tasks & Deliverables
Work Package 2 establishes a data-driven approach to identifying and curating best practices in AI for Circular Economy, delivered through a digital sharing platform accessible to all participating HEIs.
Task 2.1: Data-driven Identification
M1-M6 (March 2025 - September 2025)
Leveraging comprehensive scientific databases such as OpenAlex and SciSciNet, the project filters the global knowledge base in natural sciences and engineering to identify research that effectively uses AI for circular economy purposes.
Task 2.2: Best-Practice Curation
M4-M9 (July 2025 - December 2025)
Curation of best practices in accessible and engaging formats. To ensure researchers can fully benefit from identified best practices, these are made available in multiple formats including audio/video summaries, blog posts, and newsletters.
Best-Research-Practice Sharing Platform
A digital best-research-practice sharing platform is established and deployed for each participating HEI project partner. The platform provides unified access to identified best practices, enabling researchers to discover relevant research based on their specific focus areas and geographic proximity.
Contents
Project Deliverables
Work Package 2 produces two major deliverables that together provide a comprehensive best-practice resource for all participating HEI project partners.
Best-Research Practices Identified
M9 (December 2025) • Lead: TUHH
A documented repository with the reproducible data and analysis pipeline; interactive map and set of best-practice research papers relevant for each participating HEI project partner.
Version-controlled Python codebase with documented dependencies and configuration
Geographic visualization enabling location-based discovery of relevant practices
Best practices filtered and categorized for each partner institution's research focus
Best-Research Practices Curated
M9 (December 2025) • Lead: USC
A technical guide to produce different formats to consume best research practices, enabling multimodal content creation for diverse learning preferences.
Technical guide for producing multimedia practice overviews
Templates and workflows for accessible written content
Periodic digest formats for ongoing practice dissemination
Methodology
The methodology employs a topic-driven knowledge synthesis approach that combines semantic document embeddings, predefined topic taxonomies, and language model-powered text generation to extract structured best practices from scientific literature.
Core Approach
Unlike unsupervised clustering methods that may produce semantically incoherent groupings, this approach uses a predefined circular economy taxonomy to ensure interpretable and domain-relevant categorization. The methodology consists of four key innovations:
- Predefined topic taxonomy: Documents are matched to 16 expert-defined circular economy topics using semantic keyword analysis, ensuring consistent categorization aligned with established CE frameworks
- Anchor-context synthesis model: Within each topic, high-impact documents are selected as "anchors" and combined with semantically similar supporting documents to provide context for synthesis
- Diversity-aware sampling: Anchor selection uses embedding-based diversity sampling to ensure coverage of different approaches within each topic
- Multi-dimensional impact scoring: Each best practice is assessed across four dimensions (citation impact, innovation impact, commercial viability, replication potential)
Processing Pipeline
Figure 1: High-level overview of the best practice extraction pipeline
Data Sources & Processing
The platform leverages comprehensive scientific databases to filter the global knowledge base in natural sciences and engineering, identifying research that effectively applies AI methodologies to circular economy challenges. This multi-source approach ensures broad coverage of both academic publications and technological innovations.
Global Knowledge Base Coverage
The integrated database approach provides access to over 250 million scholarly works, enabling systematic identification of AI+CE research across disciplines. Documents are filtered using semantic matching against AI methodology keywords and circular economy topic taxonomies.
OpenAlex
Primary source for scholarly metadata and author affiliations, providing comprehensive coverage equivalent to SciSciNet for bibliometric analysis.
- • 250M+ scholarly works with rich metadata
- • Author institutional affiliations and locations
- • Geographic coordinates for research mapping
- • Citation networks and collaboration data
- • Open access status and funding information
Lens.org
Unified platform for scholarly articles and patent documents with cross-references between research and innovation.
- • Scholarly articles with full abstracts and citations
- • Patent documents with claims and jurisdictions
- • Cross-references between papers and patents
- • Patent family and legal status information
- • Applicant and inventor metadata
Knowledge Base Filtering
The filtering process systematically narrows the global corpus to identify research at the AI-CE intersection:
- AI methodology detection: Documents are scanned for AI/ML method keywords across six primary categories (machine learning, computer vision, NLP, optimization, prediction/forecasting, robotics)
- CE topic matching: Semantic matching against 16 predefined circular economy topics ensures domain relevance
- Quality filtering: Citation metrics, publication venue, and completeness criteria ensure high-quality source documents
Semantic Embedding
Documents are transformed into 768-dimensional semantic representations using SPECTER2, a transformer-based embedding model trained specifically on scientific documents. SPECTER2 captures semantic meaning beyond keyword matching, enabling:
- Computation of semantic similarity between documents
- Identification of supporting documents for context enrichment
- Diversity-aware sampling to ensure representative anchor selection
Embeddings are generated from the concatenation of document titles and abstracts, with results cached for computational efficiency across pipeline runs.
Circular Economy Topic Taxonomy
The platform uses a predefined taxonomy of 16 circular economy topics, each defined with associated keywords for semantic matching. This taxonomy-driven approach ensures consistent, interpretable categorization aligned with established circular economy frameworks and enables meaningful comparison across research areas.
Topic Matching Algorithm
Documents are assigned to topics using a weighted keyword matching algorithm that considers both direct keyword matches (higher weight) and semantic overlap with topic descriptions (lower weight). Each document is assigned to the topic with the highest matching score.
Best Practice Synthesis
The synthesis process distills knowledge from topic-grouped documents into structured best practices using an anchor-context model with language model-powered text generation.
Synthesis Process
High-Impact Filtering
Within each topic cluster, documents are ranked by citation count and other impact metrics. The top 30% are retained as candidates for anchor selection.
Diversity Sampling
From high-impact candidates, "anchor" documents are selected using embedding-based diversity sampling to ensure coverage of different approaches within the topic. Up to 5 anchors are selected per topic.
Context Collection
For each anchor, supporting documents are identified based on semantic similarity (≥70% cosine similarity). Up to 5 supporting documents provide additional context for synthesis.
Knowledge Synthesis
A language model generates structured titles and descriptions from the anchor and supporting documents. Titles follow the format: "[AI Method] for [Specific CE Application]". Descriptions emphasize the circular economy problem, AI/ML solution, and CE outcomes.
Synthesis Quality
Generated content is designed to follow academic writing conventions:
"Convolutional Neural Networks for Real-Time Polymer Identification in Material Recovery Facilities"
"Smart AI-Powered Circular Economy Solutions"
Generic terms such as "innovative," "cutting-edge," and "smart" are explicitly avoided to maintain scientific precision.
Best-Practice Curation (Task 2.2)
To ensure that researchers can fully benefit from the identified best practices, Task 2.2 focuses on curating these findings in accessible and engaging formats. This multimodal approach caters to diverse learning preferences and enhances the accessibility of scientific information for the broader research community.
Multimodal Content Formats
Audio/Video Summaries
Concise multimedia overviews of key best practices, suitable for:
- • Podcast-style research summaries
- • Short video explainers
- • Webinar presentations
- • Accessible audio descriptions
Blog Posts
Written content for broader dissemination:
- • Detailed practice explanations
- • Implementation case studies
- • Comparative analyses
- • Topic-focused deep dives
Newsletters
Regular updates for ongoing engagement:
- • Monthly practice highlights
- • New additions announcements
- • Topic trend summaries
- • Partner research features
Technical Guide for Content Production
Deliverable D2.2 provides a comprehensive technical guide enabling content creators to transform structured best practice data into various output formats:
Structured Data for Content Generation
Each best practice includes structured metadata that supports multimodal content creation:
LLM-synthesized content
Classified techniques
Domain categorization
Multi-dimensional metrics
Full provenance
Location metadata
Maturity assessment
Topic descriptors
Learning Preference Accommodation
The multimodal curation approach recognizes that researchers have diverse learning preferences and time constraints:
- Visual learners: Infographics, video summaries, and interactive visualizations
- Auditory learners: Podcast summaries and audio descriptions
- Reading-focused: Detailed blog posts and written case studies
- Time-constrained: Newsletter digests and quick reference cards
Impact Assessment
Each best practice is assessed across four complementary dimensions to provide a comprehensive view of its potential value. All scores are normalized to a 0-1 scale.
Citation Impact
Age-normalized citation analysis accounting for publication recency and document type.
Formula: min(citations / (age × type_factor), 1.0) where type_factor is 10 for papers and 5 for patents.
Innovation Impact
Based on patent family size, geographic coverage, and publication recency.
Components: Family size score (40%), jurisdiction coverage (40%), recency bonus (20%).
Commercial Viability
Industry involvement signals, patent legal status, and applicant diversity.
Components: Industry authorship, patent grant status, multi-applicant collaboration.
Replication Potential
Accessibility and reproducibility based on open access status and methodology detail.
Components: Open access availability (40%), implementation complexity (35%), performance metrics presence (25%).
The Overall Impact Score is computed as the arithmetic mean of all four dimensions, providing a balanced assessment that does not unduly favor any single aspect.
AI Method Classification
AI methods are classified into six primary categories using a hybrid approach that combines keyword-based matching with language model classification for ambiguous cases.
Machine Learning
Neural networks, random forests, SVMs, clustering, ensemble methods
Computer Vision
CNNs, object detection, image segmentation, pattern recognition
Natural Language Processing
Text mining, sentiment analysis, NER, document classification
Optimization
Genetic algorithms, linear programming, metaheuristics, scheduling
Prediction & Forecasting
Time series, demand forecasting, anomaly detection, trend analysis
Robotics
Autonomous systems, robotic manipulation, navigation, motion planning
Each best practice is assigned a primary AI method and may include secondary methods when multiple techniques are employed. The classification uses an expanded keyword taxonomy with weighted matching, falling back to language model classification when keyword matches are inconclusive.
Platform Features
The web-based platform provides multiple complementary views for exploring best practices, designed to support researchers, educators, and practitioners at participating HEIs.
Best Practices
Search, filter, and browse synthesized practices
Topic Clusters
Explore by circular economy domain
Geographic Map
Locate practices by country and region
Source Documents
Access original papers and patents
Authors
Discover researchers in the field
Analytics
Statistics and publication trends
Filtering Dimensions
The platform supports multi-dimensional filtering to enable focused exploration:
- AI Method: Machine Learning, Computer Vision, NLP, Optimization, Prediction/Forecasting, Robotics
- CE Topic: 16 predefined circular economy domains
- Document Type: Paper-based, Patent-based, or Hybrid practices
- Technology Readiness Level: Estimated TRL (1-9)
- Publication Year: Temporal filtering
- Impact Score: Minimum threshold filtering
- Geographic Location: Country and jurisdiction-based filtering
HEI Discovery & Geographic Map
The interactive map visualization is designed to enable HEIs and individual researchers to easily locate relevant best practices based on their specific research focus and geographic proximity. This supports the project goal of establishing a digital best-research-practice sharing platform at each participating HEI.
Proximity-Based Discovery
Researchers can identify best practices from nearby institutions, facilitating potential collaborations and knowledge exchange with geographically proximate partners.
HEI Partner Value
Research Focus Alignment
Filter practices by CE topics and AI methods that align with your institution's research priorities and expertise areas.
Geographic Filtering
Discover practices from your region or country to identify local expertise and potential collaboration opportunities.
Collaboration Networks
Visualize international collaboration patterns and identify research networks in your areas of interest.
Export & Integration
Export filtered practice sets for institutional use, integration with internal knowledge management systems, or further analysis.
Map Features
- Practice locations: Geocoded author affiliations and patent jurisdictions
- Cluster visualization: Grouped markers for areas with high practice concentration
- Collaboration edges: Lines connecting institutions with joint publications
- Country filtering: Click to filter practices from specific countries
- Practice preview: Hover for quick practice summaries, click for full details
Reproducibility & Repository
A core requirement of Deliverable D2.1 is providing a documented repository with the reproducible data and analysis pipeline. The project maintains full reproducibility through version-controlled code, documented dependencies, and standardized data formats.
Repository Components
Pipeline Modules
enhanced_pipeline.py- Main orchestratorextraction_pipeline.py- Best practice identificationembeddings_module_v2.py- SPECTER2 embeddingsclustering_module.py- UMAP clusteringgeocoding_module.py- Geographic enrichmentopenalex_enrichment.py- Author affiliation data
Configuration & Data
config.py- Pipeline configurationrequirements.txt- Python dependenciesdata/ce_topics.json- CE taxonomy definitionbest_practices.db- SQLite databasepublic/data/*.json- Exported static data
Reproducibility Features
Version-Controlled Codebase
Complete Python and TypeScript source code with git history for full traceability
Documented Dependencies
Pinned package versions in requirements.txt and package.json for exact reproduction
Documented Schema
Complete database schema documentation with field descriptions and relationships
Pipeline Documentation
Comprehensive README files with usage instructions, examples, and configuration options
Running the Pipeline
The pipeline can be executed with a single command and supports various configuration options:
# Full pipeline execution python enhanced_pipeline.py # Sample run for testing python enhanced_pipeline.py --sample 100 # Export data for static deployment python scripts/export_for_static.py
Limitations & Future Work
Current Limitations
- Topic coverage: The predefined taxonomy may not capture emerging or niche CE topics not represented in the current 16 categories
- Language bias: The corpus is primarily English-language, potentially missing relevant research published in other languages
- TRL estimation: Technology Readiness Levels are estimated from text analysis rather than direct assessment, introducing uncertainty
- Citation lag: Recently published documents may have artificially low citation impact scores due to citation accumulation delay
- Synthesis quality: Language model-generated content may occasionally introduce factual errors or overstate findings
Future Directions
- Dynamic taxonomy: Automatic discovery of emerging topics from document clusters
- Cross-lingual analysis: Extension to non-English literature
- Validation framework: Expert review and quality scoring of synthesized practices
- Real-time updates: Continuous ingestion of new publications
- Citation network analysis: Identifying influential research trajectories
- SciSciNet integration: Additional data source for enhanced bibliometric analysis
AI-InnoScEnCe Project • Work Package 2
Task 2.1: Data-driven Best-Practice Identification (M1-M6) • Task 2.2: Best-Practice Curation (M4-M9)
For technical implementation details, see the project repository documentation.
Back to Top ↑