With a few clicks, we’ll create a connection from GitHub to Snowflake, and Airbyte will take care to keep the pipeline running and always up-to-date. We’ll start by using Airbyte’s GitHub connector to pull data into Snowflake. Ready to go? Let's get started! Step 1 - Populate GitHub source data in Snowflake using Airbyte As an added bonus, we’ll be able to see exactly what SQL the LLM is creating - so we can better understand what actions it is performing and we can learn SQL a little better in the process. We’ll show how this new and novel approach can be used to answer real-world questions like "Who has opened the largest number of issues this month?" without writing any SQL and without requiring us to use a dedicated vector store. The LLM will automatically scan the tables metadata to get column and table names so that it can create valid SQL for our specific data model. The LLM will generate SQL to answer our natural language question and then run the SQL and return the result. Now, with LLMs and with new libraries like LlamaIndex, we can use natural language to interact with our data - without SQL expertise and without needing to learn or memorize database schemas ourselves. Even if you are an expert in SQL, you can’t write queries until you’ve learned the data model and learning each new data model takes up valuable time. Even for practitioners well-versed in SQL, there’s always a learning curve associated with learning new new tables and data models. Querying databases and data warehouses has traditionally required the questioner to have a high proficiency in SQL. Why Chat-to-SQL interfaces are so powerful Airbyte makes this fast, easy, and secure by streamlining authentication to GitHub and Snowflake and tracking tokens securely so we don’t have to worry about a separate credential management system. We’ll use Airbyte to set up a production data pipeline, which feeds data incrementally from GitHub into Snowflake on our chosen frequency: daily, weekly, or even hourly. Unlike similar “bring-your-own-data” projects, we won’t need to calculate embeddings or use an external vectorstore database. We’ll ask questions in plain english, and the AI model will write and run SQL queries to answer our questions. In this post, we’ll show how to chat with your SQL data warehouse using LlamaIndex. It would be nearly impossible - and terribly efficient - to try to answer these questions with pre-formed text documents and vector-based document retrieval. But what if you want to query data that’s already in a SQL database - or what if you have tabular data that doesn’t make sense to write into a dedicated vectorstore?įor example, what if we want to ask arbitrary historical questions of how many GitHub issues have been created in the Airbyte repo, how many PRs have been merged, and who was the most active contributor over all time? Pre-calculated embeddings would not be able to answer these questions, since they rely upon aggregations that are dynamic in nature and whose answers are changing constantly. These vector stores are well-suited for storing unstructured text data. There are some great guides out there on how to create long-term memory for AI applications using embedding-based vector stores like ChromaDB or Pinecone.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |