Patent Author Unification

Overview

The US patent database is an extremely important resource, but much of its data is unstructured. In particular, patent authors have no unique ID, and their identity must be resolved solely from their names as written in the patent and other contextual factors. A database of authors and their publication networks Links to an external site. is maintained by Prof. Lee Fleming from IEOR.

Question

The key question is to resolve the authors of patents. The task in part featurization, and in part a very large-scale clustering problem. It would also be good to be able to visualize and browse the network of authors/patents.

Dataset

A dataset is available with 7 million digitized patents and some meta-data.

References

A pre-published paper on the topic may be available.

Tools

The main challenge is clustering at large scale, with a target of 1,000,000 clusters. There are a couple of possibilities, depending on the feature dimension. Discuss this with the staff.