Structure
The information in the vector store is divided into five collections - Documentation, DDL, SQL, Python, and Plot. Each collection stores a specific type of information and is used for different purposes. They are also populated at different stages of the analysis process:- The Documentation collection is filled at the time of data ingestion,
- The DDL collection is empty by default and left to the user to fill (if needed), and
- The SQL, Python and Plot collections are filled during analysis.
Documentation collection
The documentation collection contains the following information about each of the input datasets:- Column names, data types, and number or fraction of null values.
- A snapshot of the dataset - the first five rows.
population
, the dataset details would include the following two sections:
real_estate
, the dataset details would include the following two sections:
DDL collection
Data Definition Language (DDL) examples are SQL queries that define the structure of a table. These are relevant for SQL-based analysis only. The DDL collection includes:- The SQL query used to create the table.
- The SQL query used to insert data into the table.
population
is as follows:
SQL collection
The SQL collection includes examples of SQL queries that have been generated during analysis in response to user input. They are used as examples to answer similar questions in the future. These examples are relevant only for SQL-based analysis and are only added if the SQL query executes successfully. Each such example includes:- Input question.
- SQL query generated during analysis.
Python collection
The Python collection includes examples of Python code that have been generated during analysis in response to user input. They are used as examples to answer similar questions in the future. These examples are relevant only for Pythonic analysis and are only added if the code executes successfully. Each of these examples includes:- Input question.
- Python code snippet generated during analysis.
real_estate
, the input question and Python code pair could be as follows:
Plot collection
The plot collection includes code that has been used to generate visualisations in response to user input. They are used as examples to generate similar visualisations in the future. These examples are relevant for both SQL and Pythonic analysis and are only added if the code executes successfully. Each of these examples includes:- Input question.
- Plotting code snippet generated during analysis.
conn
with table population
, the input question and plotting code pair could be as follows:
real_estate
, the input question and plotting code pair could be as follows:
Operations
The information in the vector store can be augmented and retrieved using a set of operations. By adding extra information to the vector store, you can improve the quality of responses generated by DataAnalyzr. This is especially useful in the following scenarios:- The system has difficulty intuiting the information in your dataset (e.g. when the column names are not descriptive).
- Your dataset pertains to a specific domain and you want to improve the system’s understanding of that domain.
- You want to change the way in which the system responds to a specific type of query.
- You want to encourage the system to generate specific types of responses.
- For overall performance improvement.
data_analyzr
.
The following attributes are then available:
- Vector Store:
data_analyzr.vector_store
- Documentation Collection:
data_analyzr.vector_store.documentation_collection
- DDL Collection:
data_analyzr.vector_store.ddl_collection
- SQL Collection:
data_analyzr.vector_store.sql_collection
- Python Collection:
data_analyzr.vector_store.python_collection
- Plot Collection:
data_analyzr.vector_store.plot_collection
Adding information
To add information to the vector store, follow these simple steps:- Identify the type of information you want to add - documentation, DDL, SQL, Python, or plot - and, correspondingly, the collection you want to add it to.
- Ensure that the information is in the format of a python string.
- Use the
add_training_data
method of the vector store to add the information to the relevant collection.
Retrieval
Information retrival from the vector store depends on the use case.Retrieving relevant information
To retrieve information relevant to a specific query, the following methods are available:- Documentation:
get_related_documentation
- DDL:
get_related_ddl
- SQL:
get_related_sql_queries
- Python:
get_related_python_code
- Plot:
get_related_plotting_code
Retrieving all information
To retrieve all the information in a collection, use theget
method on the collection. The returned keys include ids
, embeddings
, metadatas
, documents
, uris
, and data
.
The ids
key contains the unique identifiers of the stored information, while the documents
key contains the actual information.