Let us attempt to explain a cartesian join in the simplest of terms, so that even a beginner could understand.
First, it is helpful to understand how databases store data. They use tables, with rows and columns. Each row is a unique record, containing the same types of data, each stored in a column. For example, a table of customer data might contain columns for the first name, the last name, and a customer ID (unique to each customer, to help separate two John Smiths, for example). In that table, then, when a new customer is added to the database, a new row is added with a new ID, and the first and last name in the appropriate column.
Next, it is useful to know that data can be retrieved from the database using something called Structured Query Language, or SQL (see-kwul) for short. Different types of databases (Oracle, Teradata, SQL Server, etc.) each have their own individual SQL syntax, but they are generally similar.
All forms of SQL allow you to pull data from the table(s) of a database into a set of results, and they all allow you to JOIN more than one table.
For example, in addition to the customer table mentioned above, there could also be a Customer Address table. That table could contain a column for address, one for city, one for state, one for zip, and an Address ID (unique number for each address). And, for the sake of keeping this simple, even though you would rarely, if ever, design a database this way, it would also contain the Customer ID of the customer whose address is in that record.
That way, you could write a query that selects the name information from the Customer table, and JOIN the Customer table to the Customer Address table, using some kind of syntax similar to this:
select * from customer, customer_address
where customer.cust_id = customer_address.cust_id
The (*) means return all records. The records are coming from the customer table and the customer_address table. So, each row returned will have all the columns from the customer table AND from the customer_address table:
first name, last name, cust_id, address, city, state, zip, address_id, cust_id
Note the ‘where’ clause in the query above. That specifies to ONLY join the record in the customer_address table to the record in the customer table where the cust_ids match. That is a correct way to join.
A Cartesian join, also called a product join, results when all of the records from one table are joined to all of the records in another. In the example above, we can create a Cartesian join by removing the ‘where’ clause.
Let’s assume that there are 200 records in the customer table, and that each customer has a record in the customer_address table (so, another 200 records in the customer_address table). With the query uses the correct syntax to join the two tables, the result set should be 200 rows, with one row per customer also containing the address information.
However, if we remove the where clause, then the SQL will return EACH customer attached to EACH address. Customer 1 will have a record for address 1, and address 2, and address 3, etc., up to address 200. So will customer 2, 3, etc. all the way through customer 200.
The result set would be 40,000 records long, instead of 200. It would display each of the 200 customers for 200 times, and 200*200=40,000. That is why a cartesian join is also called a product join…because it is the multiplication of the records instead of just the joining of records.
To avoid cartesian joins, always specify the criteria for joining tables so that any given record joins ONLY to the proper matching records in the second table.