AI-InnoScEnCe Project • Work Package 2

Data-Driven Best Practice Identification

A scientific knowledge extraction methodology for identifying, synthesizing, and curating best practices at the intersection of Artificial Intelligence and Circular Economy from comprehensive scientific databases.

Project Context

This platform implements Work Package 2 of the AI-InnoScEnCe project, comprising two interconnected tasks: Task 2.1: Data-driven Best-Practice Identification and Task 2.2: Best-Practice Curation. The objective is to systematically analyze the intersection of artificial intelligence and circular economy research, extracting and curating actionable best practices from scientific literature to support higher education institutions (HEIs) and individual researchers across Europe.

Research Objectives

  • Comprehensive coverage through scientific database integration (OpenAlex, Lens.org)
  • Systematic filtering of global knowledge base for AI applications in circular economy
  • Knowledge synthesis through semantic analysis and topic-based clustering
  • Interactive map visualization enabling HEIs to locate best practices by research focus and proximity
  • Multimodal content curation for diverse learning preferences and accessibility

Work Package 2: Tasks & Deliverables

Work Package 2 establishes a data-driven approach to identifying and curating best practices in AI for Circular Economy, delivered through a digital sharing platform accessible to all participating HEIs.

gantt title Work Package 2 Timeline dateFormat YYYY-MM section Tasks T2.1 Best-Practice Identification :t1, 2025-03, 6M T2.2 Best-Practice Curation :t2, 2025-07, 5M section Deliverables D2.1 Practices Identified (M9) :milestone, d1, 2025-12, 0d D2.2 Practices Curated (M9) :milestone, d2, 2025-12, 0d

Task 2.1: Data-driven Identification

M1-M6 (March 2025 - September 2025)

Leveraging comprehensive scientific databases such as OpenAlex and SciSciNet, the project filters the global knowledge base in natural sciences and engineering to identify research that effectively uses AI for circular economy purposes.

Scientific database integration
Semantic filtering and clustering
Interactive map visualization

Task 2.2: Best-Practice Curation

M4-M9 (July 2025 - December 2025)

Curation of best practices in accessible and engaging formats. To ensure researchers can fully benefit from identified best practices, these are made available in multiple formats including audio/video summaries, blog posts, and newsletters.

Multimodal content formats
Diverse learning preference support
Enhanced accessibility

Best-Research-Practice Sharing Platform

A digital best-research-practice sharing platform is established and deployed for each participating HEI project partner. The platform provides unified access to identified best practices, enabling researchers to discover relevant research based on their specific focus areas and geographic proximity.

Project Deliverables

Work Package 2 produces two major deliverables that together provide a comprehensive best-practice resource for all participating HEI project partners.

D2.1

Best-Research Practices Identified

M9 (December 2025) • Lead: TUHH

A documented repository with the reproducible data and analysis pipeline; interactive map and set of best-practice research papers relevant for each participating HEI project partner.

Reproducible Pipeline

Version-controlled Python codebase with documented dependencies and configuration

Interactive Map

Geographic visualization enabling location-based discovery of relevant practices

HEI-Relevant Practices

Best practices filtered and categorized for each partner institution's research focus

D2.2

Best-Research Practices Curated

M9 (December 2025) • Lead: USC

A technical guide to produce different formats to consume best research practices, enabling multimodal content creation for diverse learning preferences.

Audio/Video Summaries

Technical guide for producing multimedia practice overviews

Blog Posts

Templates and workflows for accessible written content

Newsletters

Periodic digest formats for ongoing practice dissemination

Methodology

The methodology employs a topic-driven knowledge synthesis approach that combines semantic document embeddings, predefined topic taxonomies, and language model-powered text generation to extract structured best practices from scientific literature.

Core Approach

Unlike unsupervised clustering methods that may produce semantically incoherent groupings, this approach uses a predefined circular economy taxonomy to ensure interpretable and domain-relevant categorization. The methodology consists of four key innovations:

  1. Predefined topic taxonomy: Documents are matched to 16 expert-defined circular economy topics using semantic keyword analysis, ensuring consistent categorization aligned with established CE frameworks
  2. Anchor-context synthesis model: Within each topic, high-impact documents are selected as "anchors" and combined with semantically similar supporting documents to provide context for synthesis
  3. Diversity-aware sampling: Anchor selection uses embedding-based diversity sampling to ensure coverage of different approaches within each topic
  4. Multi-dimensional impact scoring: Each best practice is assessed across four dimensions (citation impact, innovation impact, commercial viability, replication potential)

Processing Pipeline

flowchart LR A[Scientific<br/>Literature] --> B[Semantic<br/>Embedding] B --> C[Topic<br/>Matching] C --> D[Anchor<br/>Selection] D --> E[Knowledge<br/>Synthesis] E --> F[Impact<br/>Scoring] style A fill:#e0f2fe style C fill:#fef3c7 style E fill:#dcfce7 style F fill:#f3e8ff

Figure 1: High-level overview of the best practice extraction pipeline

Data Sources & Processing

The platform leverages comprehensive scientific databases to filter the global knowledge base in natural sciences and engineering, identifying research that effectively applies AI methodologies to circular economy challenges. This multi-source approach ensures broad coverage of both academic publications and technological innovations.

Global Knowledge Base Coverage

The integrated database approach provides access to over 250 million scholarly works, enabling systematic identification of AI+CE research across disciplines. Documents are filtered using semantic matching against AI methodology keywords and circular economy topic taxonomies.

OpenAlex

Primary source for scholarly metadata and author affiliations, providing comprehensive coverage equivalent to SciSciNet for bibliometric analysis.

  • • 250M+ scholarly works with rich metadata
  • • Author institutional affiliations and locations
  • • Geographic coordinates for research mapping
  • • Citation networks and collaboration data
  • • Open access status and funding information

Lens.org

Unified platform for scholarly articles and patent documents with cross-references between research and innovation.

  • • Scholarly articles with full abstracts and citations
  • • Patent documents with claims and jurisdictions
  • • Cross-references between papers and patents
  • • Patent family and legal status information
  • • Applicant and inventor metadata

Knowledge Base Filtering

The filtering process systematically narrows the global corpus to identify research at the AI-CE intersection:

  1. AI methodology detection: Documents are scanned for AI/ML method keywords across six primary categories (machine learning, computer vision, NLP, optimization, prediction/forecasting, robotics)
  2. CE topic matching: Semantic matching against 16 predefined circular economy topics ensures domain relevance
  3. Quality filtering: Citation metrics, publication venue, and completeness criteria ensure high-quality source documents

Semantic Embedding

Documents are transformed into 768-dimensional semantic representations using SPECTER2, a transformer-based embedding model trained specifically on scientific documents. SPECTER2 captures semantic meaning beyond keyword matching, enabling:

  • Computation of semantic similarity between documents
  • Identification of supporting documents for context enrichment
  • Diversity-aware sampling to ensure representative anchor selection

Embeddings are generated from the concatenation of document titles and abstracts, with results cached for computational efficiency across pipeline runs.

Circular Economy Topic Taxonomy

The platform uses a predefined taxonomy of 16 circular economy topics, each defined with associated keywords for semantic matching. This taxonomy-driven approach ensures consistent, interpretable categorization aligned with established circular economy frameworks and enables meaningful comparison across research areas.

♻️
Waste Sorting & ClassificationAutomated identification and sorting of waste materials
🧴
Plastic Recycling & RecoveryPolymer identification and plastic waste processing
🔋
Battery & E-Waste RecyclingRecovery of batteries and electronic waste
🍎
Food Waste ReductionDemand forecasting and food byproduct valorization
🏗️
Construction & Demolition WasteBuilding material recovery and reuse
👕
Textile & Fashion CircularityFabric sorting and clothing resale optimization
🔧
Remanufacturing & RefurbishmentProduct recovery and predictive maintenance
📦
Reverse LogisticsOptimization of product returns and collection
🏭
Industrial SymbiosisByproduct exchange networks between industries
🎯
Product Lifecycle & DesignDesign for circularity and lifecycle assessment
🌾
Circular AgricultureCrop residue utilization and bioeconomy
💧
Water & Resource EfficiencyWater recycling and process optimization
⚙️
Metal Recovery & RecyclingFerrous and non-ferrous metal sorting
📦
Sustainable PackagingRecyclable and reusable packaging systems
🤝
Sharing EconomyProduct-as-a-service and sharing platforms
🌍
Carbon Footprint ReductionEmissions monitoring and decarbonization

Topic Matching Algorithm

Documents are assigned to topics using a weighted keyword matching algorithm that considers both direct keyword matches (higher weight) and semantic overlap with topic descriptions (lower weight). Each document is assigned to the topic with the highest matching score.

Best Practice Synthesis

The synthesis process distills knowledge from topic-grouped documents into structured best practices using an anchor-context model with language model-powered text generation.

Synthesis Process

1

High-Impact Filtering

Within each topic cluster, documents are ranked by citation count and other impact metrics. The top 30% are retained as candidates for anchor selection.

2

Diversity Sampling

From high-impact candidates, "anchor" documents are selected using embedding-based diversity sampling to ensure coverage of different approaches within the topic. Up to 5 anchors are selected per topic.

3

Context Collection

For each anchor, supporting documents are identified based on semantic similarity (≥70% cosine similarity). Up to 5 supporting documents provide additional context for synthesis.

4

Knowledge Synthesis

A language model generates structured titles and descriptions from the anchor and supporting documents. Titles follow the format: "[AI Method] for [Specific CE Application]". Descriptions emphasize the circular economy problem, AI/ML solution, and CE outcomes.

Synthesis Quality

Generated content is designed to follow academic writing conventions:

✓ Title Example

"Convolutional Neural Networks for Real-Time Polymer Identification in Material Recovery Facilities"

✗ Avoided

"Smart AI-Powered Circular Economy Solutions"

Generic terms such as "innovative," "cutting-edge," and "smart" are explicitly avoided to maintain scientific precision.

Best-Practice Curation (Task 2.2)

To ensure that researchers can fully benefit from the identified best practices, Task 2.2 focuses on curating these findings in accessible and engaging formats. This multimodal approach caters to diverse learning preferences and enhances the accessibility of scientific information for the broader research community.

Multimodal Content Formats

Audio/Video Summaries

Concise multimedia overviews of key best practices, suitable for:

  • • Podcast-style research summaries
  • • Short video explainers
  • • Webinar presentations
  • • Accessible audio descriptions

Blog Posts

Written content for broader dissemination:

  • • Detailed practice explanations
  • • Implementation case studies
  • • Comparative analyses
  • • Topic-focused deep dives

Newsletters

Regular updates for ongoing engagement:

  • • Monthly practice highlights
  • • New additions announcements
  • • Topic trend summaries
  • • Partner research features

Technical Guide for Content Production

Deliverable D2.2 provides a comprehensive technical guide enabling content creators to transform structured best practice data into various output formats:

Structured Data for Content Generation

Each best practice includes structured metadata that supports multimodal content creation:

Title & Description

LLM-synthesized content

AI Methods

Classified techniques

CE Topics

Domain categorization

Impact Scores

Multi-dimensional metrics

Source Documents

Full provenance

Geographic Data

Location metadata

TRL Level

Maturity assessment

Keywords

Topic descriptors

Learning Preference Accommodation

The multimodal curation approach recognizes that researchers have diverse learning preferences and time constraints:

  • Visual learners: Infographics, video summaries, and interactive visualizations
  • Auditory learners: Podcast summaries and audio descriptions
  • Reading-focused: Detailed blog posts and written case studies
  • Time-constrained: Newsletter digests and quick reference cards

Impact Assessment

Each best practice is assessed across four complementary dimensions to provide a comprehensive view of its potential value. All scores are normalized to a 0-1 scale.

Citation Impact

Age-normalized citation analysis accounting for publication recency and document type.

Formula: min(citations / (age × type_factor), 1.0) where type_factor is 10 for papers and 5 for patents.

Innovation Impact

Based on patent family size, geographic coverage, and publication recency.

Components: Family size score (40%), jurisdiction coverage (40%), recency bonus (20%).

Commercial Viability

Industry involvement signals, patent legal status, and applicant diversity.

Components: Industry authorship, patent grant status, multi-applicant collaboration.

Replication Potential

Accessibility and reproducibility based on open access status and methodology detail.

Components: Open access availability (40%), implementation complexity (35%), performance metrics presence (25%).

The Overall Impact Score is computed as the arithmetic mean of all four dimensions, providing a balanced assessment that does not unduly favor any single aspect.

AI Method Classification

AI methods are classified into six primary categories using a hybrid approach that combines keyword-based matching with language model classification for ambiguous cases.

Machine Learning

Neural networks, random forests, SVMs, clustering, ensemble methods

Computer Vision

CNNs, object detection, image segmentation, pattern recognition

Natural Language Processing

Text mining, sentiment analysis, NER, document classification

Optimization

Genetic algorithms, linear programming, metaheuristics, scheduling

Prediction & Forecasting

Time series, demand forecasting, anomaly detection, trend analysis

Robotics

Autonomous systems, robotic manipulation, navigation, motion planning

Each best practice is assigned a primary AI method and may include secondary methods when multiple techniques are employed. The classification uses an expanded keyword taxonomy with weighted matching, falling back to language model classification when keyword matches are inconclusive.

Platform Features

The web-based platform provides multiple complementary views for exploring best practices, designed to support researchers, educators, and practitioners at participating HEIs.

Best Practices

Search, filter, and browse synthesized practices

Topic Clusters

Explore by circular economy domain

Geographic Map

Locate practices by country and region

Source Documents

Access original papers and patents

Authors

Discover researchers in the field

Analytics

Statistics and publication trends

Filtering Dimensions

The platform supports multi-dimensional filtering to enable focused exploration:

  • AI Method: Machine Learning, Computer Vision, NLP, Optimization, Prediction/Forecasting, Robotics
  • CE Topic: 16 predefined circular economy domains
  • Document Type: Paper-based, Patent-based, or Hybrid practices
  • Technology Readiness Level: Estimated TRL (1-9)
  • Publication Year: Temporal filtering
  • Impact Score: Minimum threshold filtering
  • Geographic Location: Country and jurisdiction-based filtering

HEI Discovery & Geographic Map

The interactive map visualization is designed to enable HEIs and individual researchers to easily locate relevant best practices based on their specific research focus and geographic proximity. This supports the project goal of establishing a digital best-research-practice sharing platform at each participating HEI.

Proximity-Based Discovery

Researchers can identify best practices from nearby institutions, facilitating potential collaborations and knowledge exchange with geographically proximate partners.

HEI Partner Value

Research Focus Alignment

Filter practices by CE topics and AI methods that align with your institution's research priorities and expertise areas.

Geographic Filtering

Discover practices from your region or country to identify local expertise and potential collaboration opportunities.

Collaboration Networks

Visualize international collaboration patterns and identify research networks in your areas of interest.

Export & Integration

Export filtered practice sets for institutional use, integration with internal knowledge management systems, or further analysis.

Map Features

  • Practice locations: Geocoded author affiliations and patent jurisdictions
  • Cluster visualization: Grouped markers for areas with high practice concentration
  • Collaboration edges: Lines connecting institutions with joint publications
  • Country filtering: Click to filter practices from specific countries
  • Practice preview: Hover for quick practice summaries, click for full details

Reproducibility & Repository

A core requirement of Deliverable D2.1 is providing a documented repository with the reproducible data and analysis pipeline. The project maintains full reproducibility through version-controlled code, documented dependencies, and standardized data formats.

Repository Components

Pipeline Modules
  • enhanced_pipeline.py - Main orchestrator
  • extraction_pipeline.py - Best practice identification
  • embeddings_module_v2.py - SPECTER2 embeddings
  • clustering_module.py - UMAP clustering
  • geocoding_module.py - Geographic enrichment
  • openalex_enrichment.py - Author affiliation data
Configuration & Data
  • config.py - Pipeline configuration
  • requirements.txt - Python dependencies
  • data/ce_topics.json - CE taxonomy definition
  • best_practices.db - SQLite database
  • public/data/*.json - Exported static data

Reproducibility Features

Version-Controlled Codebase

Complete Python and TypeScript source code with git history for full traceability

Documented Dependencies

Pinned package versions in requirements.txt and package.json for exact reproduction

Documented Schema

Complete database schema documentation with field descriptions and relationships

Pipeline Documentation

Comprehensive README files with usage instructions, examples, and configuration options

Running the Pipeline

The pipeline can be executed with a single command and supports various configuration options:

# Full pipeline execution
python enhanced_pipeline.py

# Sample run for testing
python enhanced_pipeline.py --sample 100

# Export data for static deployment
python scripts/export_for_static.py

Limitations & Future Work

Current Limitations

  • Topic coverage: The predefined taxonomy may not capture emerging or niche CE topics not represented in the current 16 categories
  • Language bias: The corpus is primarily English-language, potentially missing relevant research published in other languages
  • TRL estimation: Technology Readiness Levels are estimated from text analysis rather than direct assessment, introducing uncertainty
  • Citation lag: Recently published documents may have artificially low citation impact scores due to citation accumulation delay
  • Synthesis quality: Language model-generated content may occasionally introduce factual errors or overstate findings

Future Directions

  • Dynamic taxonomy: Automatic discovery of emerging topics from document clusters
  • Cross-lingual analysis: Extension to non-English literature
  • Validation framework: Expert review and quality scoring of synthesized practices
  • Real-time updates: Continuous ingestion of new publications
  • Citation network analysis: Identifying influential research trajectories
  • SciSciNet integration: Additional data source for enhanced bibliometric analysis

AI-InnoScEnCe Project • Work Package 2

Task 2.1: Data-driven Best-Practice Identification (M1-M6) • Task 2.2: Best-Practice Curation (M4-M9)

D2.1: Best-Research Practices Identified (M9)D2.2: Best-Research Practices Curated (M9)

For technical implementation details, see the project repository documentation.

Back to Top ↑