Data Practices
Database Best Practices
Database Best Practices
Please consider the following before you start building your MariaDB database.
What are your goals for the database?
Keep in mind for what purposes your database will be used. Once built, will the database be static or will someone be updating it? Note any potential security restrictions, including who will have access to the data and editing permissions.
Planning your database
Knowing your data well is a crucial first step. Ensure that you are familiar with
datatypes, value ranges, missing data, row counts, and designated primary keys. Before
importing, determine whether cleanup is required. If it is, save interim files as
you clean in case later on you need to modify any changes. You may decide that the
delimiter is appropriate for loading the data (see Delimiters). When you are designing your database, your table architecture should reflect a
balance between the efficiency of relational data and the usability of your database
for analysis.
Fully document the preparation and building of the database
Draft up a design plan. You may find it helpful to create an entity-relationship diagram. As you construct your database, keep track of the steps taken in cleaning and preparing raw data for importation. Column naming is also an important part of documentation. Names should be descriptive and usable as well as consistent. Avoid using spaces and quoted identifiers in names. Consider how the names will be used in queries.
Data definition language
When choosing the appropriate data types, pay special attention to the following:
- type, length and sign of numeric columns
- features of fixed or variable length character, text, and binary columns
- row enumeration and auto-increment requirements
- dates and time columns
- text field usability
- how will non-categorical text fields be used?
- is it more efficient to create a separate (linked) table for large text fields so that text may be analyzed differently?
- try to avoid string literals
Refer to the CREATE TABLE statement for additional details.
Loading data
During your data load, determine whether you need to turn off logging and make sure to analyze any warnings. It is highly recommended that you first do test loads of tables. These import tests are very useful for assessing the appropriateness of data type choices. They are also excellent timesavers, especially if the full tables will be large, and they may uncover any unforeseen data problems.
Analysis of your data using SQL
Implement index key columns to maximize efficiency during joins and sorts. It is suggested that you query your data and build analysis tables in your ‘personal’ database rather than in your research (production) database. Consider if it will save time to use temporary tables when doing complex joins. While your query is running, use SHOW EXPLAIN to fine-tune the SQL. Use SELECT * and ORDER BY only if necessary, as it is more efficient to sort in a front-end program. Data analysis can be run via SQL scripts in the client, on the command line, or through code written in R, Python, SAS, and other applications.