Distributed Database

From CS Wiki
Revision as of 12:35, 15 December 2024 by Betripping (talk | contribs) (Created page with "'''Distributed Database''' is a collection of databases distributed across multiple physical locations that function as a single logical database. Each site can operate independently while participating in a unified database system through communication over a network. ==Key Concepts== *'''Data Distribution:''' Data is distributed across multiple sites based on factors like performance, reliability, and locality. *'''Transparency:''' Users interact with the distributed d...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Distributed Database is a collection of databases distributed across multiple physical locations that function as a single logical database. Each site can operate independently while participating in a unified database system through communication over a network.

Key Concepts[edit | edit source]

  • Data Distribution: Data is distributed across multiple sites based on factors like performance, reliability, and locality.
  • Transparency: Users interact with the distributed database as if it were a single database, regardless of the underlying distribution.
  • Replication: Data is duplicated across multiple sites to improve fault tolerance and availability.
  • Partitioning: Data is divided into subsets, each stored at a specific location.

Characteristics[edit | edit source]

Distributed databases are defined by the following characteristics:

  • Distributed Data Storage: Data is stored on multiple nodes or sites.
  • Autonomy: Each node can function independently and manage its local database.
  • Transparency:
    • Location Transparency: Users do not need to know where data is physically stored.
    • Replication Transparency: Users are unaware of data being replicated across sites.
    • Fragmentation Transparency: Users do not need to know how data is partitioned.
  • Scalability: The system can grow by adding more nodes.
  • Fault Tolerance: Replication and redundancy provide resilience to failures.

Types of Distributed Databases[edit | edit source]

Distributed databases can be classified based on their architecture:

  1. Homogeneous Distributed Database:
    • All nodes use the same database management system (DBMS).
    • Example: A PostgreSQL cluster.
  2. Heterogeneous Distributed Database:
    • Nodes may use different DBMSs but are integrated into a single system.
    • Example: A system integrating MySQL and Oracle databases.
  3. Federated Database:
    • Autonomous databases are integrated through a middleware layer.
    • Example: A research database integrating multiple institutional datasets.

Advantages[edit | edit source]

  • Improved Performance: Data is stored closer to where it is needed, reducing access time.
  • Fault Tolerance: Data replication ensures system availability during node failures.
  • Scalability: The system can handle growing amounts of data by adding more nodes.
  • Resource Sharing: Enables sharing of hardware, software, and data resources.

Limitations[edit | edit source]

  • Complexity: Managing a distributed database is more complex than a centralized one.
  • Consistency: Maintaining consistency across nodes in a distributed system can be challenging.
  • Communication Overhead: Data synchronization and query execution across nodes incur network overhead.
  • Latency: Network delays can affect query response times.

Example: Distributed Query in a Distributed Database[edit | edit source]

Consider a distributed database with two nodes:

  • Node 1 stores employee data.
  • Node 2 stores department data.

Query: Retrieve the names of employees in the "Sales" department.

Steps[edit | edit source]

Step Action Performed On
1 Parse query: SELECT employees.name FROM employees JOIN departments ON employees.dept_id = departments.dept_id WHERE departments.name = 'Sales'. Query Coordinator
2 Decompose query into sub-queries:
  • Query 1: Retrieve department IDs for "Sales" from Node 2.
  • Query 2: Retrieve employee names for the matching department IDs from Node 1. || Query Coordinator
3 Execute sub-queries on respective nodes:
  • Node 2 returns department IDs for "Sales."
  • Node 1 returns employee names for matching department IDs. || Node 1, Node 2
4 Combine results and return final output. Query Coordinator

Data Distribution Techniques[edit | edit source]

Distributed databases use the following techniques to distribute data:

  • Replication:
    • Duplicates data across multiple sites.
    • Improves fault tolerance and read performance but requires synchronization.
  • Fragmentation:
    • Divides data into fragments, stored at different sites.
    • Types:
      • Horizontal Fragmentation: Divides a table into rows.
      • Vertical Fragmentation: Divides a table into columns.
      • Hybrid Fragmentation: Combines horizontal and vertical fragmentation.
  • Hybrid Distribution:
    • Combines replication and fragmentation to optimize performance and fault tolerance.

Applications[edit | edit source]

Distributed databases are widely used in:

  • Global Enterprises: Managing geographically dispersed data.
  • Cloud Databases: Supporting distributed cloud-based platforms like Google Spanner and Amazon Aurora.
  • IoT Systems: Managing data from distributed devices.
  • Big Data Analytics: Processing large-scale distributed datasets.

Challenges[edit | edit source]

Distributed databases face several challenges:

  • Data Consistency: Ensuring consistency across replicas while maintaining performance.
  • Network Partitioning: Handling situations where communication between nodes is disrupted.
  • Query Optimization: Efficiently executing queries across distributed nodes.
  • Security: Securing data transmission and storage across multiple locations.

See Also[edit | edit source]