Scope

The basic aim of this PhD is to research and develop intelligent methods for automated patent classification at multiple levels of an intellectual classification scheme. It aims to 1) research and develop novel data analysis, feature engineering and ML/DL methods with respect to the automated patent classification task; and 2) devise a methodological framework so that proposed methods can be: a) evaluated with system-oriented experiments and also tested/validated with user-centered studies, b) support patent retrieval tasks (e.g., prior art search, patentability, technology landscape) and c) adapted and transferred to other text classification problems such as automatic detection of text genre, sentiment analysis, fake news detection.

Objectives

To achieve these aims, the PhD involves four intermediate Research Objectives (ROs):

RO1: Research on data engineering techniques for automated patent classification. Patent documents are lengthy and full of technical details. They contain textual and multimodal/multimedia content including various structural elements, such as the intellectual hierarchy of codes/classification scheme, the internal and external linking structure with other patents and publications, the structural facets of the patent text (e.g., title, abstract, claims). RO1 will conduct research on in-depth document analysis aiming to produce summative representations of patent documents. It will investigate various feature sets and representation models, while it will explore how structural features can be used to improve patent classification.

RO2: Research on machine learning (ML) and ensemble (fusion) techniques for automated patent classification. RO2 will investigate the recent ML algorithms that have been applied in patent classification and similar text classification tasks to predict the most representative classification code(s) for a patent. It will make research on ensemble architectures and techniques that can enhance the performance of standalone classifiers. RO2 will also perform an in-depth analysis of ML parameters that affect the training and testing process and suggest mitigation actions to alleviate the performance losses.

RO3: Development of a prototype system to validate RO1 and RO2 methods. RO3 will develop a web-based system for automated patent classification across multiple levels based on the techniques explored within RO1-RO2. The system will be evaluated with system- and user-wise criteria with the involvement of patent searchers, experts and examiners. An assessment report will be prepared summarizing the evaluation results, including also comparative studies with existing tools.

RO4: Generalize RO1 and RO2 methods. RO4 will be responsible for investigating the exploitability and extendibility of our research methods and results to other important and closely related research fields.

Innovation

This PhD goes beyond the state-of-the-art on automated patent classification in the following aspects:

  • It will provide a deep document analysis for identifying representative summaries of the patent document. Currently, only the first words of a patent document are used as input data to ML algorithms. Moreover, pre-trained unsupervised language models on large corpus, such as BERT, has a capacity of processing input lengths of size 512 tokens. A typical patent document counts several thousand words.
  • It will provide a systematic analysis on feature engineering aspects, such as feature selection, reduction, representation, etc. for automated patent classification that does not exist in current SotA.
  • It will provide innovative ensemble architectures and techniques for automated patent classification. Limited studies are available in literature with respect to the combination of different classifiers in the patent field. Among them, Kamateri et al. [1-2] show that an ensemble architecture of classifiers significantly outperforms current SotA techniques using the same classifiers as standalone and that such ensemble methods need to further be investigated.
  • It will provide a novel methodological framework for the validation, integration and transferability of the proposed methods that is not supported by existing patent classification tools.
 

Patent Classification

Test Collection

CLEFIP-0.54M - A patent classification dataset (subpart of CLEFIP 2011)

Classification system

International Patent Classification (IPC)


Patent Classification


The website was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the HFRI PhD Fellowship grant (Fellowship Number: 10695).