Your Own Initial Embeddings
Cleora generally handles embedding initialization on its own. However, it is possible that your relational data has some additional properties, which you want to use to enhance Cleora embeddings. For example, your products could have images or photos, which are represented with some embedding vectors obtained from methods such as SimCLR or CLIP. You can use these embeddings for initializing Cleora - this way, the Cleora embeddings will express not only behavioral relations, but also knowledge about image similarities.
Note 1: Initial embeddings can be given only to entities in Column 2 in the Input File.
Note 2: You can use any embedding dimension. Cleora will adjust.
Note 3: You can give initial embeddings only to SOME entities. Cleora will utilize the provided embeddings when available, and for other entities, it will initialize embeddings using its standard initialization method.
Option 1: .tsv file
Prepare a .tsv file in which the first column contains entity identifiers (corresponding to Column 2 in the Input File), and other columns contain embeddings.
>>> your_embedding_matrix.shape
(171002, 129)
# there are 171002 entity ids
# the embedding size is 128, the first column is the entity id
>>> np.savetxt("your_initial_embeddings.tsv", your_embedding_matrix, delimiter="\t")
Option 2: .npz file
- Pack your entity identifiers (corresponding to Column 2 in the Input File) into a Numpy array.
>>> your_entity_ids_arr.shape
(171002, )
# there are 171002 entity ids
- Pack your embeddings into a 2D Numpy matrix.
>>> your_embedding_matrix.shape
(171002, 128)
# the embedding size is 128
- Save the .npz file with appropriate array identifiers:
np.savez("your_initial_embeddings.npz", embeddings=your_embedding_matrix, entity_ids=your_entity_ids_arr)
IMPORTANT: your_entity_ids_arr
should correspond to your_embedding_matrix
and maintain the same ordering; the first entity ID from your_entity_ids_arr
will match the first row of your_embedding_matrix
.
Examples of Bad Initial Embeddings
Here are examples of some bad practices regarding initial embeddings:
- Embeddings with identifiers from Column 1 instead of Column 2. For example, if your Column 1 expresses users and Column 2 expresses products, then do not give embeddings for users. Give embeddings for products.
- Very small embeddings - this can happen when you have only a few parameters for your entities. Generally, embedding length should not be smaller than 128 for optimal performance.
- Embeddings from a poor set of properties or a poor text/image embedding method. The poor quality of initial embeddings will degrade the final quality.